How can I match a quote-delimited string with a regex?

前端 未结 9 689
一生所求
一生所求 2020-12-01 05:05

If I\'m trying to match a quote-delimited string with a regex, which of the following is \"better\" (where \"better\" means both more efficient and less likely to do somethi

相关标签:
9条回答
  • 2020-12-01 05:59

    Using the negated character class prevents matching when the boundary character (doublequotes, in your example) is present elsewhere in the input.

    Your example #1:

    /"[^"]+"/ # match quote, then everything that's not a quote, then a quote

    matches only the smallest pair of matched quotes -- excellent, and most of the time that's all you'll need. However, if you have nested quotes, and you're interested in the largest pair of matched quotes (or in all the matched quotes), you're in a much more complicated situation.

    Luckily Damian Conway is ready with the rescue: Text::Balanced is there for you, if you find that there are multiple matched quote marks. It also has the virtue of matching other paired punctuation, e.g. parentheses.

    0 讨论(0)
  • 2020-12-01 05:59

    I prefer the first regex, but it's certainly a matter of taste.

    The first one might be more efficient?

    Search for double-quote
    add double-quote to group
    for each char:
        if double-quote:
            break
        add to group
    add double-quote to group
    

    Vs something a bit more complicated involving back-tracking?

    0 讨论(0)
  • 2020-12-01 06:01

    You should use number one, because number two is bad practice. Consider that the developer who comes after you wants to match strings that are followed by an exclamation point. Should he use:

    "[^"]*"!
    

    or:

    ".*?"!
    

    The difference appears when you have the subject:

    "one" "two"!
    

    The first regex matches:

    "two"!
    

    while the second regex matches:

    "one" "two"!
    

    Always be as specific as you can. Use the negated character class when you can.

    Another difference is that [^"]* can span across lines, while .* doesn't unless you use single line mode. [^"\n]* excludes the line breaks too.

    As for backtracking, the second regex backtracks for each and every character in every string that it matches. If the closing quote is missing, both regexes will backtrack through the entire file. Only the order in which then backtrack is different. Thus, in theory, the first regex is faster. In practice, you won't notice the difference.

    0 讨论(0)
提交回复
热议问题