Convert Perl regular expression to equivalent ECMAScript regular expression

前端 未结 1 849
滥情空心
滥情空心 2021-01-18 23:53

Now I\'m using VC++ 2010, but the syntax_option_type of VC++ 2010 only contains the following options:

static const flag_type icase = regex_cons         


        
1条回答
  •  南方客
    南方客 (楼主)
    2021-01-19 00:22

    For the particular regex you want to convert, the equivalent in ECMA regex is:

    /^(\d{3,4})[- ]?(\d{4})[- ]?(\d{4})[- ]?(\d{4})$/
    

    In this case, \A (in Perl regex) has the same meaning as ^ (in ECMA regex) (matching beginning of the string) and \Z (in Perl regex) has the same meaning as $ (in ECMA regex) (matching the end of the string). Note that meaning of ^ and $ in ECMA regex will change to matching the beginning and the end of the line if you enable multiline mode.

    ECMA regex is a subset of Perl regex, so if the regex uses exclusive features in Perl regex, it is likely that it is not convertible to ECMA regex. Even for same syntax, the syntax may mean slightly different thing between 2 dialects of regex, so it is always wise to check the documentation and compare the usage.

    I'm only going to say what is similar between ECMA regex and Perl regex. What is not similar, but convertible, I will mention it to the most of my ability.

    ECMA regex is lacking on features to work with Unicode, which compels you to look up the code points and specify them as character classes.

    Going according to the documentation for Perl regular expression:

    • Modifiers:
      • Only i, g, m are in ECMA Standard, and they behave the same as in Perl.
      • s dot-all modifier can be simulated in ECMA regex by using 2 complementing character classes e.g. [\S\s], [\D\d]
      • No support in anyway for x and p flag.
      • I don't know if there is anyway to simulate the rest (prefix and suffix modifiers).
    • Meta characters:
      • I have a bit of doubt about using \ with non-meta character that doesn't resolve to any special meaning, but it should be fine if you don't escape where you don't need to. . in ECMA excludes a few more characters. The rest behaves the same in ECMA regex (even effect of m flag on ^ and $).
    • Quantifier:
      • Greedy and Lazy behavior should be the same. There is no possessive behavior in ECMA regex.
    • Escape sequences:
      • There's no \a and \e in ECMA regex. \t, \n, \r, \f are the same.
      • Check the documentation if the regex has \cX - there are differences.
      • \xhh is common in ECMA regex and Perl regex (specifying 2 hexadecimal digits is the safest - otherwise, you will have to look up the documentation to see how the language will deal with the case where there are less than 2 hexadecimal digits).
      • \uhhhh is ECMA regex exclusive feature to specify Unicode character. Perl has other exclusive ways to specify character such as \x{}, \N{}, \o{}, \000.
      • \l, \u, \L, \U are exclusive to Perl regex.
      • \Q and \E can be simulated by escaping the quoted section by hand.
      • Octal escape (which has less than 3 octal digits) in Perl regex may be confusing. Check the context carefully, read the documentation, and/or test the regex to make sure you understand what it is doing in context, since it might be either escaped sequence or back reference.
    • Character classes and other special escapes:
      • \w, \W, \s, \S, \d, \D are equivalent in ECMA regex and Perl regex, if assuming US-ASCII. If Unicode is involved, things will be a bloody mess.
      • No POSIX character class in ECMA regex. Use the above \w, \s, \d or specify yourself in character class.
      • Back reference is mostly the same - but I don't know if it allows the back reference to go beyond 9 for both Perl and ECMA regex.
      • Named reference can be simulated with back reference.
      • The rest (except [] and already mentioned escaped sequences) are unsupported in ECMA regex.
    • Assertion:
      • \b and \B are equivalent in both languages, with regards to how they are defined based on \w.
    • Capture groups: Grouping () and back reference are the same. $n, which is used in the replacement string to back reference to matched text, is the same. The rest in the section are Perl exclusive features.
    • Quoting meta-characters: (Content already mentioned in previous sections).
    • Extended Pattern:
      • ECMA regex doesn't support modification of flags inside regex. Depending on what the flags are, you may be able to rewrite the regex (s flag is one that can always be converted to equivalent expression in ECMA regex).
      • Only (?:pattern) (non-capturing group), (?=pattern) (positive look ahead), (?!pattern) (negative look ahead) are common between Perl and ECMA.
      • There is no comment in ECMA regex, so (?#text) can be ignored.
      • Look-behinds are not supported in ECMA regex. Fixed-width look-behind is supported in Perl. In some cases, regex with positive look behind written in Perl can be converted to ECMA regex, by making the look-behind a capturing group.
      • As mentioned before, named pattern can be converted to normal capture group and can be referred to with numbered back reference.
      • The rest are Perl exclusive features.
    • Special Backtracking Control Verbs: This is Perl exclusive, and I have no idea what these do (never touched them before), let alone conversion. It's most likely the case that they are not convertible anyway.

    Conclusion:

    If the regex utilize the full power of Perl regex, or at the level which Boost library supports (e.g. recursive regex), it is not possible to convert the regex to ECMA regex. Fortunately, ECMA regex covers the most commonly used features, so it's likely that the regex are convertible.

    Reference:

    ECMA RegExp Reference on MDN

    0 讨论(0)
提交回复
热议问题