Regex that can match empty string is breaking the javascript regex engine

前端 未结 2 1113
暗喜
暗喜 2020-11-27 08:29

I wrote the following regex: /\\D(?!.*\\D)|^-?|\\d+/g

I think it should work this way:

\\D(?!.*\\D)    # match the last non-digit
|              


        
相关标签:
2条回答
  • 2020-11-27 09:18

    JS works differently than PCRE. The point is that the JS regex engine does not handle zero-length matches well, the index is just manually incremented and the next character after a zero-length match is skipped. The ^-? can match an empty string, and it matches the 12,345,678.90 start, skipping 1.

    If we have a look at the String#match documentation, we will see that each call to match with a global regex increases the regex object's lastIndex after the zero-length match is found:

    1. Else, global is true
      a. Call the [[Put]] internal method of rx with arguments "lastIndex" and 0.
      b. Let A be a new array created as if by the expression new Array() where Array is the standard built-in constructor with that name.
      c. Let previousLastIndex be 0.
      d. Let n be 0.
      e. Let lastMatch be true.
      f. Repeat, while lastMatch is true
          i. Let result be the result of calling the [[Call]] internal method of exec with rx as the this value and argument list containing S.
          ii. If result is null, then set lastMatch to false.
          iii. Else, result is not null
              1. Let thisIndex be the result of calling the [[Get]] internal method of rx with argument "lastIndex".
              2. If thisIndex = previousLastIndex then
                  a. Call the [[Put]] internal method of rx with arguments "lastIndex" and thisIndex+1.
                  b. Set previousLastIndex to thisIndex+1.

    So, the matching process goes from 8a till 8f initializing the auxiliary structures, then a while block is entered (repeated until lastMatch is true, an internal exec command matches the empty space at the start of the string (8fi -> 8fiii), and as the result is not null, thisIndex is set to the lastIndex of the previous successful match, and as the match was zero-length (basically, thisIndex = previousLastIndex), the previousLastIndex is set to thisIndex+1 - which is skipping the current position after a successful zero-length match.

    You may actually use a simpler regex inside a replace method and use a callback to use appropriate replacements:

    var res = '-12,345,678.90'.replace(/(\D)(?!.*\D)|^-|\D/g, function($0,$1) {
       return $1 ? "." : "";
    });
    console.log(res);

    Pattern details:

    • (\D)(?!.*\D) - a non-digit (captured into Group 1) that is not followed with 0+ chars other than a newline and another non-digit
    • | - or
    • ^- - a hyphen at the string start
    • | - or
    • \D - a non-digit

    Note that here you do not even have to make the hyphen at the start optional.

    0 讨论(0)
  • 2020-11-27 09:24

    You can reorder your alternation patterns and use this in JS to make it work:

    var arrTest = '12,345,678.90'.match(/\D(?!.*\D)|\d+|^-?/g);
    console.log(arrTest);
    
    var test = arrTest.join('').replace(/\D/, '.');
    
    console.log(test);
    
    //=> 12345678.90

    RegEx Demo

    This is the difference between Javascript and PHP(PCRE) regex behavior.

    In Javascript:

    '12345'.match(/^|.+/gm)
    //=> ["", "2345"]
    

    In PHP:

    preg_match_all('/^|.+/m', '12345', $m);
    print_r($m);
    Array
    (
        [0] => Array
            (
                [0] =>
                [1] => 12345
            )
        )
    

    So when you match ^ in Javascript, regex engine moves one position ahead and anything after alternation | matches from 2nd position omwards in input.

    0 讨论(0)
提交回复
热议问题