I know that there are similar questions to mine that have been answered, but after reading through them I still don\'t have the solution I\'m looking for.
Using Pyth
Here's one way to make a regular expression that will match any date of your desired format (though you could obviously tweak whether commas are optional, add month abbreviations, and so on):
years = r'((?:19|20)\d\d)'
pattern = r'(%%s) +(%%s), *%s' % years
thirties = pattern % (
"September|April|June|November",
r'0?[1-9]|[12]\d|30')
thirtyones = pattern % (
"January|March|May|July|August|October|December",
r'0?[1-9]|[12]\d|3[01]')
fours = '(?:%s)' % '|'.join('%02d' % x for x in range(4, 100, 4))
feb = r'(February) +(?:%s|%s)' % (
r'(?:(0?[1-9]|1\d|2[0-8])), *%s' % years, # 1-28 any year
r'(?:(29), *((?:(?:19|20)%s)|2000))' % fours) # 29 leap years only
result = '|'.join('(?:%s)' % x for x in (thirties, thirtyones, feb))
r = re.compile(result)
print result
Then we have:
>>> r.match('January 30, 2001') is not None
True
>>> r.match('January 31, 2001') is not None
True
>>> r.match('January 32, 2001') is not None
False
>>> r.match('February 32, 2001') is not None
False
>>> r.match('February 29, 2001') is not None
False
>>> r.match('February 28, 2001') is not None
True
>>> r.match('February 29, 2000') is not None
True
>>> r.match('April 30, 1908') is not None
True
>>> r.match('April 31, 1908') is not None
False
And what is this glorious regexp, you may ask?
>>> print result
(?:(September|April|June|November) +(0?[1-9]|[12]\d|30), *((?:19|20)\d\d))|(?:(January|March|May|July|August|October|December) +(0?[1-9]|[12]\d|3[01]), *((?:19|20)\d\d))|(?:February +(?:(?:(0?[1-9]|1\d|2[0-8]), *((?:19|20)\d\d))|(?:(29), *((?:(?:19|20)(?:04|08|12|16|20|24|28|32|36|40|44|48|52|56|60|64|68|72|76|80|84|88|92|96))|2000))))
(I initially intended to do a tongue-in-cheek enumeration of the possible dates, but I basically ended up hand-writing that whole gross thing except for the multiples of four, anyway.)
Here are some quick thoughts:
Everyone who is suggesting you use something other than regular expression is giving you very good advice. On the other hand, it's always a good time to learn more about regular expression syntax...
An expression in square brackets -- [...]
-- matches any single character inside those brackets. So writing [,]
, which only contains a single character, is exactly identical to writing a simple unadorned comma: ,
.
The .findall
method returns a list of all matching groups in the string. A group is identified by parenthese -- (...)
-- and they count from left to right, outermost first. Your final expression looks like this:
((19|20)[0-9][0-9])
The outermost parentheses match the entire year, and the inside parentheses match the first two digits. Hence, for a date like "1989", the final two match groups are going to be 1989
and 19
.
First of all as other as said i don't think that regular expression are the best choice to solve this problem but to answer your question. By using parenthesis you are dissecting the string into several subgroups and when you call the function findall, you will create a list with all the matching group you created and the matching string.
((19|20)[0-9][0-9])
Here is your problem, the regex will match both the entire year and 19 or 20 depending on whether the year start with 19 or 20.
You have this regular expression:
pattern = "(January|February|March|April|May|June|July|August|September|October|November|December)[,][ ](0[1-9]|[12][0-9]|3[01])[,][ ]((19|20)[0-9][0-9])"
One feature of regular expressions is a "character class". Characters in square brackets make a character class. Thus [,]
is a character class matching a single character, ,
(a comma). You might as well just put the comma.
Perhaps you wanted to make the comma optional? You can do that by putting a question mark after it: ,?
Anything you put into parentheses makes a "match group". I think the mysterious extra "19" came from a match group you didn't mean to have. You can make a non-matching group using this syntax: (?:
So, for example:
r'(?:red|blue) socks'
This would match "red socks" or "blue socks" but does not make a match group. If you then put that inside plain parentheses:
r'((?:red|blue) socks)'
That would make a match group, whose value would be "red socks"
or "blue socks"
I think if you apply these comments to your regular expression, it will work. It is mostly correct now.
As for validating the date against the month, that is way beyond the scope of a regular expression. Your pattern will match "February 31"
and there is no easy way to fix that.
Python has a date parser as part of the time
module:
import time
time.strptime("December 31, 2012", "%B %d, %Y")
The above is all you need if the date format is always the same.
So, in real production code, I would write a regular expression that parses the date, and then use the results from the regular expression to build a date string that is always the same format.
Now that you said, in the comments, that this is homework, I'll post another answer with tips on regular expressions.
A group is identified by parentheses (...)
and they count from left to right, outermost first. Your final expression looks like this:
((19|20)[0-9][0-9])
The outermost parentheses match the entire year, and the inside parentheses match the first two digits. Hence, for a date like "1989", the two match groups are going to be 1989 and 19. Since you don't want the inner group (first two digits), you should use a non-capturing group instead. Non-capturing groups start with ?:
, used like this: (?:a|b|c)
By the way, there is some good documentation on how to use regular expressions here.