Thinking about my other problem, i decided I can\'t even create a regular expression that will match roman numerals (let alone a context-free grammar that will generate them
To avoid matching the empty string you'll need to repeat the pattern four times and replace each 0
with a 1
in turn, and account for V
, L
and D
:
(M{1,4}(CM|CD|D?C{0,3})(XC|XL|L?X{0,3})(IX|IV|V?I{0,3})|M{0,4}(CM|C?D|D?C{1,3})(XC|XL|L?X{0,3})(IX|IV|V?I{0,3})|M{0,4}(CM|CD|D?C{0,3})(XC|X?L|L?X{1,3})(IX|IV|V?I{0,3})|M{0,4}(CM|CD|D?C{0,3})(XC|XL|L?X{0,3})(IX|I?V|V?I{1,3}))
In this case (because this pattern uses ^
and $
) you would be better off checking for empty lines first and don't bother matching them. If you are using word boundaries then you don't have a problem because there's no such thing as an empty word. (At least regex doesn't define one; don't start philosophising, I'm being pragmatic here!)
In my own particular (real world) case I needed match numerals at word endings and I found no other way around it. I needed to scrub off the footnote numbers from my plain text document, where text such as "the Red Seacl and the Great Barrier Reefcli" had been converted to the Red Seacl and the Great Barrier Reefcli
. But I still had problems with valid words like Tahiti
and fantastic
are scrubbed into Tahit
and fantasti
.