Recognise an arbitrary date string

后端 未结 14 811
抹茶落季
抹茶落季 2020-12-01 16:18

I need to be able to recognise date strings. It doesn\'t matter if I can not distinguish between month and date (e.g. 12/12/10), I just need to classify the string as being

相关标签:
14条回答
  • 2020-12-01 16:36

    Usually dates are characters separated by a back/forward slash or a dash. Did you consider a regular expression?

    I am assuming you are not looking to classify dates of the type Sunday, October 3rd 2010 and so on

    0 讨论(0)
  • 2020-12-01 16:38

    Rules that might help you in your quest:

    1. Make or find some sort of a database with known words that match months. Abbreviated and full names, like Jan or January. While searching, it must be case insensitive, because fEBruaRy is also a month, although the person typing it must have been drunk. If you plan to search non-english months, a database is also needed, because no heuristic will find out that "Wrzesień" is polish for september.
    2. For english only, check out ordinal numbers and also make a database for numbers 1 to 31. These will be useful for days and months. If you want to use this approach for other languages, then you will have to do your own research.
    3. Once again, english only, check for "Anno Domini" and "Before Christ", that is, AD and BC respectively. They can also be in form A.D. and B.C.
    4. Concerning numbers themselves that will represent days, months and years, you must know where your limit is. Is it 0-9999, or more? That is, do you want to search for dates that represent years beyond year 9999? If no, then strings that have 1-4 consecutive digits are good guesses for a valid day, month or year.
    5. Days and months have one or two digits. Leading zeros are acceptable, so strings with a format of 0*, where * can be 1-9 are acceptable.
    6. Separators can be tricky, but if you don't allow inconsistent formatting like 10/20\1999, then you will save yourself a lot of grief. This is because 10*20*1999 can be a valid date, with * usually being one element of set {-,_, ,:,/,\,.,','}, but it's possible that * is a combination of 2 or 3 elements of mentioned set. Once again, you must choose acceptable separators. 10?20?1999 can be a valid date for somebody with a weird sense of elegance. 10 / 20 / 1999 can also be a valid date, but 10_/20_/1999 would be a very strange one.
    7. There are cases with no separator. For example: 10Jan1988. These cases use words from 1.
    8. There are special cases, like February 28th or 29th, depending on leap year. Also, months with 30 or 31 days.

    I think these are enough for a "naive" classification, a linguist expert might help you more.

    Now, an idea for your algorithm. Speed doesn't matter. There might be multiple passes over the same string. Optimize when it starts to matter. When you doubt that you have found a date string, store it somewhere "safe" in a ListOfPossibleDates and do an examination once again, with more rigid rules using combinations from 1. to 8. When you believe a date string is valid, feed it to the Date class to see if it's really valid. 32nd March 1999 is not valid, when you convert it to a format that Date will understand.

    One important recurring pattern is lookbehind and lookaround. When you believe a valid entity (day, month, year) is found, you'll have to see what lies behind and after. A stack based mechanism or recursion might help here.

    Steps:

    1. Search your string for words from rule 1. If you find any of them, note that location. Note the month. Now, go a few characters behind and a few ahead to see what awaits you. If there are no spaces before and after your month, and there are numbers, like in rule 7., check them for validity. If one of them represents a day (must be 0-31) and other a year (must be 0-9999, possibly with AD or BC), you have one candidate. If there are the same separators before and after, look for rules from 6. Always remember that you must be sure that a valid combination exists. so, 32Jan1999 won't do.
    2. Search your string for other english words, from rules 2. and 3. Repeat similarly like in step 1.
    3. Search for separators. Empty space will be the trickiest. Try to find them in pairs. So, if you have one "/" in your string, find another one and see what they have inbetween. If you find a combination of separators, to the same thing. Also, use the algorithm from step 2.
    4. Search for digits. Valid ones are 0-9999 with leading zeroes allowed. If you find one, look for separators like in step 3.

    Since there is literally a countless amount of possibilities, you won't be able to catch them all. Once you have found a pattern that you believe could occur once again, store it somewhere and you can use it as a regex for passing other strings.

    Let's take your example, "bla bla bla bla 12 Jan 09 bla bla bla 01/04/10 bla bla bla". After you extract the first date, 12 Jan 09, then use the rest of that string ("bla bla bla 01/04/10 bla bla bla") and apply all above steps once again. This way you'll be sure you didn't miss anything.

    I hope these suggestions will be at least of some help. If there doesn't exist a library for do all these dirty (and more) steps for you, then you have a tough road ahead of you. Good luck!

    0 讨论(0)
  • 2020-12-01 16:40

    I don't know of any library that does this either. I would suggest a mix of nested recursive functions and regular expressions (a lot) to match strings and try to come up with a best guess to see if it can be a date. Dates can be written in a lot of different ways, some people might write them out as "Sunday, October 3 2010" or "Sunday, October 3rd 2010" or "10/03/2010" or "10/3/2010" and a whole bunch of different ways (even more if you are considering dates in other languages/cultures).

    0 讨论(0)
  • 2020-12-01 16:44

    I am sure researchers in information extraction have looked at this problem, but I couldn't find a paper.

    One thing you can try is do it as a two step process. (1) after collecting as much data as you can, extract features, some features that come to mind: number of numbers that appear in the string, number of numbers from 1-31 that appear in the string, number of numbers from 1-12 that appear in the string, number of months names that appear in the string, and so on. (2) learn from the features using some type of binary classification method (SVM for example) and finally (3) when a new string comes by, extract the features and query the SVM for a prediction.

    0 讨论(0)
  • 2020-12-01 16:44

    It is virtually impossible to recognize all possible date formats as dates using "standard" algorithms. That's just because there are so many of them.

    We, humans are capable of doing that just because we learned that something like 2010-03-31 resembles date. In other words, I would suggest to use Machine Learning algorithms and teach your program to recognize valid date sequences. With Google Prediction API that should be feasible.

    Or you can use Regular Expressions as suggested above, to detect some but not all date formats.

    0 讨论(0)
  • 2020-12-01 16:46

    Use JChronic

    You may want to use DateParser2 from edu.mit.broad.genome.utils package.

    0 讨论(0)
提交回复
热议问题