How do you match only valid roman numerals with a regular expression?

前端 未结 16 2218
無奈伤痛
無奈伤痛 2020-11-22 02:44

Thinking about my other problem, i decided I can\'t even create a regular expression that will match roman numerals (let alone a context-free grammar that will generate them

相关标签:
16条回答
  • 2020-11-22 03:12

    I've seen multiple answers that doesn't cover empty strings or uses lookaheads to solve this. And I want to add a new answer that does cover empty strings and doesn't use lookahead. The regex is the following one:

    ^(I[VX]|VI{0,3}|I{1,3})|((X[LC]|LX{0,3}|X{1,3})(I[VX]|V?I{0,3}))|((C[DM]|DC{0,3}|C{1,3})(X[LC]|L?X{0,3})(I[VX]|V?I{0,3}))|(M+(C[DM]|D?C{0,3})(X[LC]|L?X{0,3})(I[VX]|V?I{0,3}))$

    I'm allowing for infinite M, with M+ but of course someone could change to M{1,4} to allow only 1 or 4 if desired.

    Below is a visualization that helps to understand what it is doing, preceded by two online demos:

    Debuggex Demo

    Regex 101 Demo

    Regular expression visualization

    0 讨论(0)
  • 2020-11-22 03:17

    The positive look-behind and look-ahead suggested by @paxdiablo in order to avoid matching empty strings seems not working to me.

    I have fixed it by using negative look-ahead instead :

    (?!$)M{0,4}(CM|CD|D?C{0,3})(XC|XL|L?X{0,3})(IX|IV|V?I{0,3})
    

    NB: if you append something (eg. "foobar" at the end of the regex, then obviously you'll have to replace (?!$) by (?!f) (where f is the first character of "foobar").

    0 讨论(0)
  • 2020-11-22 03:18
    import re
    pattern = '^M{0,3}(CM|CD|D?C{0,3})(XC|XL|L?X{0,3})(IX|IV|V?I{0,3})$'
    if re.search(pattern, 'XCCMCI'):
        print 'Valid Roman'
    else:
        print 'Not valid Roman'
    

    For people who really want to understand the logic, please take a look at a step by step explanation on 3 pages on diveintopython.

    The only difference from original solution (which had M{0,4}) is because I found that 'MMMM' is not a valid Roman numeral (also old Romans most probably have not thought about that huge number and will disagree with me). If you are one of disagreing old Romans, please forgive me and use {0,4} version.

    0 讨论(0)
  • 2020-11-22 03:19

    Im answering this question Regular Expression in Python for Roman Numerals here
    because it was marked as an exact duplicate of this question.

    It might be similar in name, but this is a specific regex question / problem
    as can be seen by this answer to that question.

    The items being sought can be combined into a single alternation and then
    encased inside a capture group that will be put into a list with the findall()
    function.
    It is done like this :

    >>> import re
    >>> target = (
    ... r"this should pass v" + "\n"
    ... r"this is a test iii" + "\n"
    ... )
    >>>
    >>> re.findall( r"(?m)\s(i{1,3}v*|v)$", target )
    ['v', 'iii']
    

    The regex modifications to factor and capture just the numerals are this :

     (?m)
     \s 
     (                     # (1 start)
          i{1,3} 
          v* 
       |  v
     )                     # (1 end)
     $
    
    0 讨论(0)
  • 2020-11-22 03:21

    You can use the following regex for this:

    ^M{0,4}(CM|CD|D?C{0,3})(XC|XL|L?X{0,3})(IX|IV|V?I{0,3})$
    

    Breaking it down, M{0,4} specifies the thousands section and basically restrains it to between 0 and 4000. It's a relatively simple:

       0: <empty>  matched by M{0}
    1000: M        matched by M{1}
    2000: MM       matched by M{2}
    3000: MMM      matched by M{3}
    4000: MMMM     matched by M{4}
    

    You could, of course, use something like M* to allow any number (including zero) of thousands, if you want to allow bigger numbers.

    Next is (CM|CD|D?C{0,3}), slightly more complex, this is for the hundreds section and covers all the possibilities:

      0: <empty>  matched by D?C{0} (with D not there)
    100: C        matched by D?C{1} (with D not there)
    200: CC       matched by D?C{2} (with D not there)
    300: CCC      matched by D?C{3} (with D not there)
    400: CD       matched by CD
    500: D        matched by D?C{0} (with D there)
    600: DC       matched by D?C{1} (with D there)
    700: DCC      matched by D?C{2} (with D there)
    800: DCCC     matched by D?C{3} (with D there)
    900: CM       matched by CM
    

    Thirdly, (XC|XL|L?X{0,3}) follows the same rules as previous section but for the tens place:

     0: <empty>  matched by L?X{0} (with L not there)
    10: X        matched by L?X{1} (with L not there)
    20: XX       matched by L?X{2} (with L not there)
    30: XXX      matched by L?X{3} (with L not there)
    40: XL       matched by XL
    50: L        matched by L?X{0} (with L there)
    60: LX       matched by L?X{1} (with L there)
    70: LXX      matched by L?X{2} (with L there)
    80: LXXX     matched by L?X{3} (with L there)
    90: XC       matched by XC
    

    And, finally, (IX|IV|V?I{0,3}) is the units section, handling 0 through 9 and also similar to the previous two sections (Roman numerals, despite their seeming weirdness, follow some logical rules once you figure out what they are):

    0: <empty>  matched by V?I{0} (with V not there)
    1: I        matched by V?I{1} (with V not there)
    2: II       matched by V?I{2} (with V not there)
    3: III      matched by V?I{3} (with V not there)
    4: IV       matched by IV
    5: V        matched by V?I{0} (with V there)
    6: VI       matched by V?I{1} (with V there)
    7: VII      matched by V?I{2} (with V there)
    8: VIII     matched by V?I{3} (with V there)
    9: IX       matched by IX
    

    Just keep in mind that that regex will also match an empty string. If you don't want this (and your regex engine is modern enough), you can use positive look-behind and look-ahead:

    (?<=^)M{0,4}(CM|CD|D?C{0,3})(XC|XL|L?X{0,3})(IX|IV|V?I{0,3})(?=$)
    

    (the other alternative being to just check that the length is not zero beforehand).

    0 讨论(0)
  • 2020-11-22 03:21

    Just to save it here:

    (^(?=[MDCLXVI])M*(C[MD]|D?C{0,3})(X[CL]|L?X{0,3})(I[XV]|V?I{0,3})$)
    

    Matches all the Roman numerals. Doesn't care about empty strings (requires at least one Roman numeral letter). Should work in PCRE, Perl, Python and Ruby.

    Online Ruby demo: http://rubular.com/r/KLPR1zq3Hj

    Online Conversion: http://www.onlineconversion.com/roman_numerals_advanced.htm

    0 讨论(0)
提交回复
热议问题