How do you match only valid roman numerals with a regular expression?

前端未结

关注

 16  2238

Thinking about my other problem, i decided I can\'t even create a regular expression that will match roman numerals (let alone a context-free grammar that will generate them

相关标签:

16条回答

太阳男子

2020-11-22 03:22
To avoid matching the empty string you'll need to repeat the pattern four times and replace each 0 with a 1 in turn, and account for V, L and D:
```
(M{1,4}(CM|CD|D?C{0,3})(XC|XL|L?X{0,3})(IX|IV|V?I{0,3})|M{0,4}(CM|C?D|D?C{1,3})(XC|XL|L?X{0,3})(IX|IV|V?I{0,3})|M{0,4}(CM|CD|D?C{0,3})(XC|X?L|L?X{1,3})(IX|IV|V?I{0,3})|M{0,4}(CM|CD|D?C{0,3})(XC|XL|L?X{0,3})(IX|I?V|V?I{1,3}))
```
In this case (because this pattern uses ^ and $) you would be better off checking for empty lines first and don't bother matching them. If you are using word boundaries then you don't have a problem because there's no such thing as an empty word. (At least regex doesn't define one; don't start philosophising, I'm being pragmatic here!)

In my own particular (real world) case I needed match numerals at word endings and I found no other way around it. I needed to scrub off the footnote numbers from my plain text document, where text such as "the Red Sea^cl and the Great Barrier Reef^cli" had been converted to the Red Seacl and the Great Barrier Reefcli. But I still had problems with valid words like Tahiti and fantastic are scrubbed into Tahit and fantasti.
0 讨论(0)
发布评论:

提交评论
- 加载中...
旧时难觅i

2020-11-22 03:23

Actually, your premise is flawed. 990 IS "XM", as well as "CMXC".

The Romans were far less concerned about the "rules" than your third grade teacher. As long as it added up, it was OK. Hence "IIII" was just as good as "IV" for 4. And "IIM" was completely cool for 998.

(If you have trouble dealing with that... Remember English spellings were not formalized until the 1700s. Until then, as long as the reader could figure it out, it was good enough).

0 讨论(0)
发布评论:

提交评论
- 加载中...
北海茫月

2020-11-22 03:23

As Jeremy and Pax pointed out above ... '^M{0,4}(CM|CD|D?C{0,3})(XC|XL|L?X{0,3})(IX|IV|V?I{0,3})$' should be the solution you're after ...

The specific URL that should have been attached (IMHO) is http://thehazeltree.org/diveintopython/7.html

Example 7.8 is the short form using {n,m}

0 讨论(0)
发布评论:

提交评论
- 加载中...
隐瞒了意图╮

2020-11-22 03:23
Steven Levithan uses this regex in his post which validates roman numerals prior to "deromanizing" the value:
```
/^M*(?:D?C{0,3}|C[MD])(?:L?X{0,3}|X[CL])(?:V?I{0,3}|I[XV])$/
```
0 讨论(0)
发布评论:

提交评论
- 加载中...
时光取名叫无心

2020-11-22 03:23
The following expression worked for me to validate the roman number.
```
^M{0,4}(C[MD]|D?C{0,3})(X[CL]|L?X{0,3})(I[XV]|V?I{0,3})$
```
Here,
- M{0,4} will match thousands
- C[MD]|D?C{0,3} will match Hundreds
- X[CL]|L?X{0,3} will match Tens
- I[XV]|V?I{0,3} will match Units
Below is a visualization that helps to understand what it is doing, preceded by two online demos:

Debuggex Demo

Regex 101 Demo

Python Code:
```
import re
regex = re.compile("^M{0,4}(C[MD]|D?C{0,3})(X[CL]|L?X{0,3})(I[XV]|V?I{0,3})$")
matchArray = regex.match("MMMCMXCIX")
```
0 讨论(0)
发布评论:

提交评论
- 加载中...
梦谈多话

2020-11-22 03:24
In my case, I was trying to find and replace all occurences of roman numbers by one word inside the text, so I couldn't use the start and end of lines. So the @paxdiablo solution found many zero-length matches. I ended up with the following expression:
```
(?=\b[MCDXLVI]{1,6}\b)M{0,4}(?:CM|CD|D?C{0,3})(?:XC|XL|L?X{0,3})(?:IX|IV|V?I{0,3})
```
My final Python code was like this:
```
import re
text = "RULES OF LIFE: I. STAY CURIOUS; II. NEVER STOP LEARNING"
text = re.sub(r'(?=\b[MCDXLVI]{1,6}\b)M{0,4}(?:CM|CD|D?C{0,3})(?:XC|XL|L?X{0,3})(?:IX|IV|V?I{0,3})', 'ROMAN', text)
print(text)
```
Output:
```
RULES OF LIFE: ROMAN. STAY CURIOUS; ROMAN. NEVER STOP LEARNING
```
0 讨论(0)
发布评论:

提交评论
- 加载中...