How to parse multiple dates from a block of text in Python (or another language)

前端 未结 4 1509
一向
一向 2021-01-01 17:43

I have a string that has several date values in it, and I want to parse them all out. The string is natural language, so the best thing I\'ve found so far is dateutil.

相关标签:
4条回答
  • 2021-01-01 18:28

    Looking at it, the least hacky way would be to modify dateutil parser to have a fuzzy-multiple option.

    parser._parse takes your string, tokenizes it with _timelex and then compares the tokens with data defined in parserinfo.

    Here, if a token doesn't match anything in parserinfo, the parse will fail unless fuzzy is True.

    What I suggest you allow non-matches while you don't have any processed time tokens, then when you hit a non-match, process the parsed data at that point and start looking for time tokens again.

    Shouldn't take too much effort.


    Update

    While you're waiting for your patch to get rolled in...

    This is a little hacky, uses non-public functions in the library, but doesn't require modifying the library and is not trial-and-error. You might have false positives if you have any lone tokens that can be turned into floats. You might need to filter the results some more.

    from dateutil.parser import _timelex, parser
    
    a = "I like peas on 2011-04-23, and I also like them on easter and my birthday, the 29th of July, 1928"
    
    p = parser()
    info = p.info
    
    def timetoken(token):
      try:
        float(token)
        return True
      except ValueError:
        pass
      return any(f(token) for f in (info.jump,info.weekday,info.month,info.hms,info.ampm,info.pertain,info.utczone,info.tzoffset))
    
    def timesplit(input_string):
      batch = []
      for token in _timelex(input_string):
        if timetoken(token):
          if info.jump(token):
            continue
          batch.append(token)
        else:
          if batch:
            yield " ".join(batch)
            batch = []
      if batch:
        yield " ".join(batch)
    
    for item in timesplit(a):
      print "Found:", item
      print "Parsed:", p.parse(item)
    

    Yields:

    Found: 2011 04 23
    Parsed: 2011-04-23 00:00:00
    Found: 29 July 1928
    Parsed: 1928-07-29 00:00:00

    Update for Dieter

    Dateutil 2.1 appears to be written for compatibility with python3 and uses a "compatability" library called six. Something isn't right with it and it's not treating str objects as text.

    This solution works with dateutil 2.1 if you pass strings as unicode or as file-like objects:

    from cStringIO import StringIO
    for item in timesplit(StringIO(a)):
      print "Found:", item
      print "Parsed:", p.parse(StringIO(item))
    

    If you want to set option on the parserinfo, instantiate a parserinfo and pass it to the parser object. E.g:

    from dateutil.parser import _timelex, parser, parserinfo
    info = parserinfo(dayfirst=True)
    p = parser(info)
    
    0 讨论(0)
  • 2021-01-01 18:31

    Why not writing a regex pattern covering all the possible forms in which a date can appear, and then launching the regex to explore the text ? I presume that there are not dozen of dozens of manners to express a date in a string.

    The only problem is to gather the maximum of date's expressions

    0 讨论(0)
  • 2021-01-01 18:35

    I think if you put the "words" in an array, it should do the trick. With that you can verify if it is a date or no, and put in a variable.

    Once you have the date you should use datetime library library.

    0 讨论(0)
  • 2021-01-01 18:39

    While I was offline, I was bothered by the answer I posted here yesterday. Yes it did the job, but it was unnecessarily complicated and extremely inefficient.

    Here's the back-of-the-envelope edition that should do a much better job!

    import itertools
    from dateutil import parser
    
    jumpwords = set(parser.parserinfo.JUMP)
    keywords = set(kw.lower() for kw in itertools.chain(
        parser.parserinfo.UTCZONE,
        parser.parserinfo.PERTAIN,
        (x for s in parser.parserinfo.WEEKDAYS for x in s),
        (x for s in parser.parserinfo.MONTHS for x in s),
        (x for s in parser.parserinfo.HMS for x in s),
        (x for s in parser.parserinfo.AMPM for x in s),
    ))
    
    def parse_multiple(s):
        def is_valid_kw(s):
            try:  # is it a number?
                float(s)
                return True
            except ValueError:
                return s.lower() in keywords
    
        def _split(s):
            kw_found = False
            tokens = parser._timelex.split(s)
            for i in xrange(len(tokens)):
                if tokens[i] in jumpwords:
                    continue 
                if not kw_found and is_valid_kw(tokens[i]):
                    kw_found = True
                    start = i
                elif kw_found and not is_valid_kw(tokens[i]):
                    kw_found = False
                    yield "".join(tokens[start:i])
            # handle date at end of input str
            if kw_found:
                yield "".join(tokens[start:])
    
        return [parser.parse(x) for x in _split(s)]
    

    Example usage:

    >>> parse_multiple("I like peas on 2011-04-23, and I also like them on easter and my birthday, the 29th of July, 1928")
    [datetime.datetime(2011, 4, 23, 0, 0), datetime.datetime(1928, 7, 29, 0, 0)]
    

    It's probably worth noting that its behaviour deviates slightly from dateutil.parser.parse when dealing with empty/unknown strings. Dateutil will return the current day, while parse_multiple returns an empty list which, IMHO, is what one would expect.

    >>> from dateutil import parser
    >>> parser.parse("")
    datetime.datetime(2011, 8, 12, 0, 0)
    >>> parse_multiple("")
    []
    

    P.S. Just spotted MattH's updated answer which does something very similar.

    0 讨论(0)
提交回复
热议问题