Best way to identify and extract dates from text Python?

前端 未结 7 791
孤城傲影
孤城傲影 2020-12-13 06:10

As part of a larger personal project I\'m working on, I\'m attempting to separate out inline dates from a variety of text sources.

For example, I have a large list o

相关标签:
7条回答
  • 2020-12-13 06:56

    I was also looking for a solution to this and couldn't find any, so a friend and I built a tool to do this. I thought I would come back and share incase others found it helpful.

    datefinder -- find and extract dates inside text

    Here's an example:

    import datefinder
    
    string_with_dates = '''
        Central design committee session Tuesday 10/22 6:30 pm
        Th 9/19 LAB: Serial encoding (Section 2.2)
        There will be another one on December 15th for those who are unable to make it today.
        Workbook 3 (Minimum Wage): due Wednesday 9/18 11:59pm
        He will be flying in Sept. 15th.
        We expect to deliver this between late 2021 and early 2022.
    '''
    
    matches = datefinder.find_dates(string_with_dates)
    for match in matches:
        print(match)
    
    0 讨论(0)
  • 2020-12-13 06:58

    Newer versions of parsedatetime lib provide search functionality.

    Example

    from dateparser.search import search_dates
    
    dates = search_dates('Central design committee session Tuesday 10/22 6:30 pm')
    
    0 讨论(0)
  • 2020-12-13 07:05
    import datefinder
    string_with_dates = """
                        entries are due by January 4th, 2017 at 8:00pm
                        created 01/15/2005 by ACME Inc. and associates.
                        """
    matches = datefinder.find_dates(string_with_dates)
    for match in matches:
        print match
    
    0 讨论(0)
  • 2020-12-13 07:09

    If you can identify the segments that actually contain the date information, parsing them can be fairly simple with parsedatetime. There are a few things to consider though namely that your dates don't have years and you should pick a locale.

    >>> import parsedatetime
    >>> p = parsedatetime.Calendar()
    >>> p.parse("December 15th")
    ((2013, 12, 15, 0, 13, 30, 4, 319, 0), 1)
    >>> p.parse("9/18 11:59 pm")
    ((2014, 9, 18, 23, 59, 0, 4, 319, 0), 3)
    >>> # It chooses 2014 since that's the *next* occurence of 9/18
    

    It doesn't always work perfectly when you have extraneous text.

    >>> p.parse("9/19 LAB: Serial encoding")
    ((2014, 9, 19, 0, 15, 30, 4, 319, 0), 1)
    >>> p.parse("9/19 LAB: Serial encoding (Section 2.2)")
    ((2014, 2, 2, 0, 15, 32, 4, 319, 0), 1)
    

    Honestly, this seems like the kind of problem that would be simple enough to parse for particular formats and pick the most likely out of each sentence. Beyond that, it would be a decent machine learning problem.

    0 讨论(0)
  • 2020-12-13 07:11

    Hi I'm not sure bellow approach is machine learning but you may try it:

    • add some context from outside text, e.g publishing time of text message, posting, now etc. (your text doesn't tell anything about year)
    • extract all tokens with separator white-space and should get something like this:

      ['Th','Wednesday','9:34pm','7:34','pm','am','9/18','9/','/18', '19','12']
      
    • process them with rule-sets e.g subsisting from weekdays and/or variations of components forming time and mark them e.g. '%d:%dpm', '%d am', '%d/%d', '%d/ %d' etc. may means time. Note that it may have compositions e.g. "12 / 31" is 3gram ('12','/','31') should be one token "12/31" of interest.

    • "see" what tokens are around marked tokens like "9:45pm" e.g ('Th",'9/19','9:45pm') is 3gram formed from "interesting" tokens and apply rules about it that may determine meaning.

    • process for more specific analysis for example if have 31/12 so 31 > 12 means d/m, or vice verse, but if have 12/12 m,d will be available only in context build from text and/or outside.

    Cheers

    0 讨论(0)
  • I am surprised that there is no mention of SUTime and dateparser's search_dates method.

    from sutime import SUTime
    import os
    import json
    from dateparser.search import search_dates
    
    str1 = "Let's meet sometime next Thursday" 
    
    # You'll get more information about these jar files from SUTime's github page
    jar_files = os.path.join(os.path.dirname(__file__), 'jars')
    sutime = SUTime(jars=jar_files, mark_time_ranges=True)
    
    print(json.dumps(sutime.parse(str1), sort_keys=True, indent=4))
    """output: 
    [
        {
            "end": 33,
            "start": 20,
            "text": "next Thursday",
            "type": "DATE",
            "value": "2018-10-11"
        }
    ]
    """
    
    print(search_dates(str1))
    #output:
    #[('Thursday', datetime.datetime(2018, 9, 27, 0, 0))]
    

    Although I have tried other modules like dateutil, datefinder and natty (couldn't get duckling to work with python), this two seem to give the most promising results.

    The results from SUTime are more reliable and it's clear from the above code snippet. However, the SUTime fails in some basic scenarios like parsing a text

    "I won't be available until 9/19"

    or

    "I won't be available between (September 18-September 20).

    It gives no result for the first text and only gives month and year for the second text. This is however handled quite well in the search_dates method. search_dates method is more aggressive and will give all possible dates related to any words in the input text.

    I haven't yet found a way to parse the text strictly for dates in search_methods. If I could find a way to do that, it'll be my first choice over SUTime and I would also make sure to update this answer if I find it.

    0 讨论(0)
提交回复
热议问题