Is it faster to use alternation than subsequent replacements in regular expressions

前端 未结 6 488
耶瑟儿~
耶瑟儿~ 2021-01-05 17:27

I have quite a straightforward question. Where I work I see a lot of regular expressions come by. They are used in Perl to get replace and/or get rid of some strings in text

6条回答
  •  不知归路
    2021-01-05 17:31

    How regex alternation is implemented in Perl is fairly well explained in perldoc perlre

    Matching this or that

    We can match different character strings with the alternation metacharacter '|' . To match dog or cat , we form the regex dog|cat . As before, Perl will try to match the regex at the earliest possible point in the string. At each character position, Perl will first try to match the first alternative, dog . If dog doesn't match, Perl will then try the next alternative, cat . If cat doesn't match either, then the match fails and Perl moves to the next position in the string. Some examples:

    "cats and dogs" =~ /cat|dog|bird/;  # matches "cat"
    "cats and dogs" =~ /dog|cat|bird/;  # matches "cat" 
    

    Even though dog is the first alternative in the second regex, cat is able to match earlier in the string.

    "cats"          =~ /c|ca|cat|cats/; # matches "c"
    "cats"          =~ /cats|cat|ca|c/; # matches "cats" 
    

    Here, all the alternatives match at the first string position, so the first alternative is the one that matches. If some of the alternatives are truncations of the others, put the longest ones first to give them a chance to match.

    "cab" =~ /a|b|c/ # matches "c"
                     # /a|b|c/ == /[abc]/ 
    

    The last example points out that character classes are like alternations of characters. At a given character position, the first alternative that allows the regexp match to succeed will be the one that matches.

    So this should explain the price you pay when using alternations in regex.

    When putting simple regex together, you don't pay such a price. It's well explained in another related question in SO. When directly searching for a constant string, or a set of characters as in the question, optimizations can be done and no backtracking is needed which means potentially faster code.

    When defining the regex alternations, just choosing a good order (putting the most common findings first) can influence the performance. It is not the same either to choose between two options, or twenty. As always, premature optimization is the root of all evil and you should instrumentiate you code (Devel::NYTProf) if there are problems or you want improvements. But as a general rule alternations should be kept to a minimum and avoided if possible since:

    • They easily make the regex too big an complex. We like simple, easy to understand / debug / maintain regex.
    • Variability and input dependant. They could be an unexpected source of problems since they backtrack and can lead to unexpected lack of performance depending on your input. As I understand, there's no case when they will be faster.
    • Conceptually you are trying to match two different things, so we could argue that two different statements are more correct and clear than just one.

    Hope this answer gets closer to what you were expecting.

提交回复
热议问题