I need to be able to recognise date strings. It doesn\'t matter if I can not distinguish between month and date (e.g. 12/12/10), I just need to classify the string as being
Usually dates are characters separated by a back/forward slash or a dash. Did you consider a regular expression?
I am assuming you are not looking to classify dates of the type Sunday, October 3rd 2010 and so on
Rules that might help you in your quest:
Jan
or January
. While searching, it must be case insensitive, because fEBruaRy is also a month, although the person typing it must have been drunk. If you plan to search non-english months, a database is also needed, because no heuristic will find out that "Wrzesień" is polish for september.0*
, where * can be 1-9 are acceptable.{-,_, ,:,/,\,.,','}
, but it's possible that * is a combination of 2 or 3 elements of mentioned set. Once again, you must choose acceptable separators. 10?20?1999 can be a valid date for somebody with a weird sense of elegance. 10 / 20 / 1999 can also be a valid date, but 10_/20_/1999 would be a very strange one.I think these are enough for a "naive" classification, a linguist expert might help you more.
Now, an idea for your algorithm. Speed doesn't matter. There might be multiple passes over the same string. Optimize when it starts to matter. When you doubt that you have found a date string, store it somewhere "safe" in a ListOfPossibleDates
and do an examination once again, with more rigid rules using combinations from 1. to 8. When you believe a date string is valid, feed it to the Date
class to see if it's really valid. 32nd March 1999 is not valid, when you convert it to a format that Date
will understand.
One important recurring pattern is lookbehind and lookaround. When you believe a valid entity (day, month, year) is found, you'll have to see what lies behind and after. A stack based mechanism or recursion might help here.
Steps:
Since there is literally a countless amount of possibilities, you won't be able to catch them all. Once you have found a pattern that you believe could occur once again, store it somewhere and you can use it as a regex for passing other strings.
Let's take your example, "bla bla bla bla 12 Jan 09 bla bla bla 01/04/10 bla bla bla"
. After you extract the first date, 12 Jan 09
, then use the rest of that string ("bla bla bla 01/04/10 bla bla bla"
) and apply all above steps once again. This way you'll be sure you didn't miss anything.
I hope these suggestions will be at least of some help. If there doesn't exist a library for do all these dirty (and more) steps for you, then you have a tough road ahead of you. Good luck!
I don't know of any library that does this either. I would suggest a mix of nested recursive functions and regular expressions (a lot) to match strings and try to come up with a best guess to see if it can be a date. Dates can be written in a lot of different ways, some people might write them out as "Sunday, October 3 2010" or "Sunday, October 3rd 2010" or "10/03/2010" or "10/3/2010" and a whole bunch of different ways (even more if you are considering dates in other languages/cultures).
I am sure researchers in information extraction have looked at this problem, but I couldn't find a paper.
One thing you can try is do it as a two step process. (1) after collecting as much data as you can, extract features, some features that come to mind: number of numbers that appear in the string, number of numbers from 1-31 that appear in the string, number of numbers from 1-12 that appear in the string, number of months names that appear in the string, and so on. (2) learn from the features using some type of binary classification method (SVM for example) and finally (3) when a new string comes by, extract the features and query the SVM for a prediction.
It is virtually impossible to recognize all possible date formats as dates using "standard" algorithms. That's just because there are so many of them.
We, humans are capable of doing that just because we learned that something like 2010-03-31 resembles date. In other words, I would suggest to use Machine Learning algorithms and teach your program to recognize valid date sequences. With Google Prediction API that should be feasible.
Or you can use Regular Expressions as suggested above, to detect some but not all date formats.
Use JChronic
You may want to use DateParser2 from edu.mit.broad.genome.utils package.