Regular Expression to match valid dates

后端 未结 15 1853
庸人自扰
庸人自扰 2020-11-22 04:48

I\'m trying to write a regular expression that validates a date. The regex needs to match the following

  • M/D/YYYY
  • MM/DD/YYYY
  • Single digit mon
15条回答
  •  攒了一身酷
    2020-11-22 05:09

    I landed here because the title of this question is broad and I was looking for a regex that I could use to match on a specific date format (like the OP). But I then discovered, as many of the answers and comments have comprehensively highlighted, there are many pitfalls that make constructing an effective pattern very tricky when extracting dates that are mixed-in with poor quality or non-structured source data.

    In my exploration of the issues, I have come up with a system that enables you to build a regular expression by arranging together four simpler sub-expressions that match on the delimiter, and valid ranges for the year, month and day fields in the order you require.

    These are :-

    Delimeters

    [^\w\d\r\n:] 
    

    This will match anything that is not a word character, digit character, carriage return, new line or colon. The colon has to be there to prevent matching on times that look like dates (see my test Data)

    You can optimise this part of the pattern to speed up matching, but this is a good foundation that detects most valid delimiters.

    Note however; It will match a string with mixed delimiters like this 2/12-73 that may not actually be a valid date.

    Year Values

    (\d{4}|\d{2})
    

    This matches a group of two or 4 digits, in most cases this is acceptable, but if you're dealing with data from the years 0-999 or beyond 9999 you need to decide how to handle that because in most cases a 1, 3 or >4 digit year is garbage.

    Month Values

    (0?[1-9]|1[0-2])
    

    Matches any number between 1 and 12 with or without a leading zero - note: 0 and 00 is not matched.

    Date Values

    (0?[1-9]|[12]\d|30|31)
    

    Matches any number between 1 and 31 with or without a leading zero - note: 0 and 00 is not matched.

    This expression matches Date, Month, Year formatted dates

    (0?[1-9]|[12]\d|30|31)[^\w\d\r\n:](0?[1-9]|1[0-2])[^\w\d\r\n:](\d{4}|\d{2})
    

    But it will also match some of the Year, Month Date ones. It should also be bookended with the boundary operators to ensure the whole date string is selected and prevent valid sub-dates being extracted from data that is not well-formed i.e. without boundary tags 20/12/194 matches as 20/12/19 and 101/12/1974 matches as 01/12/1974

    Compare the results of the next expression to the one above with the test data in the nonsense section (below)

    \b(0?[1-9]|[12]\d|30|31)[^\w\d\r\n:](0?[1-9]|1[0-2])[^\w\d\r\n:](\d{4}|\d{2})\b
    

    There's no validation in this regex so a well-formed but invalid date such as 31/02/2001 would be matched. That is a data quality issue, and as others have said, your regex shouldn't need to validate the data.

    Because you (as a developer) can't guarantee the quality of the source data you do need to perform and handle additional validation in your code, if you try to match and validate the data in the RegEx it gets very messy and becomes difficult to support without very concise documentation.

    Garbage in, garbage out.

    Having said that, if you do have mixed formats where the date values vary, and you have to extract as much as you can; You can combine a couple of expressions together like so;

    This (disastrous) expression matches DMY and YMD dates

    (\b(0?[1-9]|[12]\d|30|31)[^\w\d\r\n:](0?[1-9]|1[0-2])[^\w\d\r\n:](\d{4}|\d{2})\b)|(\b(0?[1-9]|1[0-2])[^\w\d\r\n:](0?[1-9]|[12]\d|30|31)[^\w\d\r\n:](\d{4}|\d{2})\b)
    

    BUT you won't be able to tell if dates like 6/9/1973 are the 6th of September or the 9th of June. I'm struggling to think of a scenario where that is not going to cause a problem somewhere down the line, it's bad practice and you shouldn't have to deal with it like that - find the data owner and hit them with the governance hammer.

    Finally, if you want to match a YYYYMMDD string with no delimiters you can take some of the uncertainty out and the expression looks like this

    \b(\d{4})(0[1-9]|1[0-2])(0[1-9]|[12]\d|30|31)\b
    

    But note again, it will match on well-formed but invalid values like 20010231 (31th Feb!) :)

    Test data

    In experimenting with the solutions in this thread I ended up with a test data set that includes a variety of valid and non-valid dates and some tricky situations where you may or may not want to match i.e. Times that could match as dates and dates on multiple lines.

    I hope this is useful to someone.

    Valid Dates in various formats
    
    Day, month, year
    2/11/73
    02/11/1973
    2/1/73
    02/01/73
    31/1/1973
    02/1/1973
    31.1.2011
    31-1-2001
    29/2/1973
    29/02/1976 
    03/06/2010
    12/6/90
    
    month, day, year
    02/24/1975 
    06/19/66 
    03.31.1991
    2.29.2003
    02-29-55
    03-13-55
    03-13-1955
    12\24\1974
    12\30\1974
    1\31\1974
    03/31/2001
    01/21/2001
    12/13/2001
    
    Match both DMY and MDY
    12/12/1978
    6/6/78
    06/6/1978
    6/06/1978
    
    using whitespace as a delimiter
    
    13 11 2001
    11 13 2001
    11 13 01 
    13 11 01
    1 1 01
    1 1 2001
    
    Year Month Day order
    76/02/02
    1976/02/29
    1976/2/13
    76/09/31
    
    YYYYMMDD sortable format
    19741213
    19750101
    
    Valid dates before Epoch
    12/1/10
    12/01/660
    12/01/00
    12/01/0000
    
    Valid date after 2038
    
    01/01/2039
    01/01/39
    
    Valid date beyond the year 9999
    
    01/01/10000
    
    Dates with leading or trailing characters
    
    12/31/21/
    31/12/1921AD
    31/12/1921.10:55
    12/10/2016  8:26:00.39
    wfuwdf12/11/74iuhwf
    fwefew13/11/1974
    01/12/1974vdwdfwe
    01/01/99werwer
    12321301/01/99
    
    Times that look like dates
    
    12:13:56
    13:12:01
    1:12:01PM
    1:12:01 AM
    
    Dates that runs across two lines
    
    1/12/19
    74
    
    01/12/19
    74/13/1946
    
    31/12/20
    08:13
    
    Invalid, corrupted or nonsense dates
    
    0/1/2001
    1/0/2001
    00/01/2100
    01/0/2001
    0101/2001
    01/131/2001
    31/31/2001
    101/12/1974
    56/56/56
    00/00/0000
    0/0/1999
    12/01/0
    12/10/-100
    74/2/29
    12/32/45
    20/12/194
    
    2/12-73
    

提交回复
热议问题