After using parsedatetime to get a time structure from the input string, how does one slice the rest of the string out?

后端未结

关注

 1  1622

爱一瞬间的悲伤 2021-01-24 04:28

I\'m wondering how to use parsedatetime for Python to return both the timestruct and the rest of the input string with just the date/time input removed.

Exa

1条回答

有刺的猬 (楼主)

2021-01-24 04:44

The only method of Calendar that returns that info is nlp() (which I suppose stands for Natural Language Processing). Here is a function returning all parts of the input:

import parsedatetime

calendar = parsedatetime.Calendar()

def parse(string, source_time = None):
    ret = []
    parsed_parts = calendar.nlp(string, source_time)
    if parsed_parts:
        last_stop = 0
        for part in parsed_parts:
            dt, status, start, stop, segment = part
            if start > last_stop:
                ret.append((None, 0, string[last_stop:start]))
            ret.append((dt, status, segment))
            last_stop = stop
        if len(string) > last_stop:
            ret.append((None, 0, string[last_stop:]))
    return ret

for s in ("Soccer with @homies at Payne Whitney tomorrow at 2 pm to 4 pm!",
          "Soccer with @homies at Payne Whitney tomorrow starting at 2 pm to 4 pm!",
          "Soccer with @homies at Payne Whitney tomorrow starting at 3 pm to 5 pm!"):
    print()
    print(s)
    result = parse(s)
    for part in result:
        print(part)

Output:

Soccer with @homies at Payne Whitney tomorrow at 2 pm to 4 pm!
(None, 0, 'Soccer with @homies at Payne Whitney ')
(datetime.datetime(2020, 1, 15, 16, 0), 3, 'tomorrow at 2 pm to 4 pm')
(None, 0, '!')

Soccer with @homies at Payne Whitney tomorrow starting at 2 pm to 4 pm!
(None, 0, 'Soccer with @homies at Payne Whitney ')
(datetime.datetime(2020, 1, 15, 9, 0), 1, 'tomorrow')
(None, 0, ' starting ')
(datetime.datetime(2020, 1, 14, 16, 0), 2, 'at 2 pm to 4 pm')
(None, 0, '!')

Soccer with @homies at Payne Whitney tomorrow starting at 3 pm to 5 pm!
(None, 0, 'Soccer with @homies at Payne Whitney ')
(datetime.datetime(2020, 1, 15, 9, 0), 1, 'tomorrow')
(None, 0, ' starting ')
(datetime.datetime(2020, 1, 14, 15, 0), 2, 'at 3 pm')
(None, 0, ' to ')
(datetime.datetime(2020, 1, 14, 17, 0), 2, '5 pm')
(None, 0, '!')

The status tells you whether the associated datetime is actually a date (1), a time (2), a datetime (3) or neither (0). In the first two cases, the missing fields are taken from the source_time, or from the current time if that is None.

But if you examine the output closely, you will see that there is a reliability problem here. Only the third parse can be used, in the other two cases information has been lost. Furthermore, I have no idea why the second and third string would be parsed differently.

An alternative library is dateparser. It looks more powerful, but has its own problems. The dateparser.parse.search_dates() function comes close to what you are interested in, but I haven't been able to find out how to tell whether a parsed datetime conveys only date information, only time information, or both. Anyway, here is a function that uses search_dates() to yield an output similar to the above, but without the status of each part:

from dateparser.search import search_dates

def parse(string: str):
    ret = []
    parsed_parts = search_dates(string)
    if parsed_parts:
        last_stop = 0
        for part in parsed_parts:
            segment, dt = part
            start = string.find(segment, last_stop)
            stop = start + len(segment)
            if start > last_stop:
                ret.append((None, string[last_stop:start]))
            ret.append((dt, segment))
            last_stop = stop
        if len(string) > last_stop:
            ret.append((None, string[last_stop:]))
    return ret


for s in ("Soccer with @homies at Payne Whitney tomorrow at 2 pm to 4 pm!",
          "Soccer with @homies at Payne Whitney tomorrow starting at 2 pm to 4 pm!",
          "Soccer with @homies at Payne Whitney tomorrow starting at 3 pm to 5 pm!"):
    print()
    print(s)
    result = parse(s)
    for part in result:
        print(part)

Output:

Soccer with @homies at Payne Whitney tomorrow at 2 pm to 4 pm!
(None, 'Soccer with @homies at Payne Whitney ')
(datetime.datetime(2020, 1, 15, 14, 0), 'tomorrow at 2 pm')
(None, ' to ')
(datetime.datetime(2020, 1, 13, 16, 0), '4 pm')
(None, '!')

Soccer with @homies at Payne Whitney tomorrow starting at 2 pm to 4 pm!
(None, 'Soccer with @homies at Payne Whitney ')
(datetime.datetime(2020, 1, 15, 0, 43, 0, 726130), 'tomorrow')
(None, ' starting ')
(datetime.datetime(2020, 1, 13, 14, 0), 'at 2 pm')
(None, ' to ')
(datetime.datetime(2020, 1, 13, 16, 0), '4 pm')
(None, '!')

Soccer with @homies at Payne Whitney tomorrow starting at 3 pm to 5 pm!
(None, 'Soccer with @homies at Payne Whitney ')
(datetime.datetime(2020, 1, 15, 0, 43, 0, 784468), 'tomorrow')
(None, ' starting ')
(datetime.datetime(2020, 1, 13, 15, 0), 'at 3 pm')
(None, ' to ')
(datetime.datetime(2020, 1, 13, 17, 0), '5 pm')
(None, '!')

I think that searching for the substring in the input is acceptable, and the parsing seems more predictable, but not knowing how to interpret each datetime is a problem.

0 讨论(0)