When is a issue too complex for a regular expression?

后端 未结 13 979
执念已碎
执念已碎 2021-02-01 19:42

Please don\'t answer the obvious, but what are the limit signs that tell us a problem should not be solved using regular expressions?

For example: Why is a complete emai

相关标签:
13条回答
  • 2021-02-01 20:14

    Here's a good quote from Raymond Chen:

    Don't make regular expressions do what they're not good at. If you want to match a simple pattern, then match a simple pattern. If you want to do math, then do math. As commenter Maurits put it, "The trick is not to spend time developing a combination hammer/screwdriver, but just use a hammer and a screwdriver.

    Source

    0 讨论(0)
  • 2021-02-01 20:17

    Along with tremendous expressions, there are principal limitations on the words, which can be handled by regexp. For instance you can not not write regexp for word described by n chars a, then n chars b, where n can be any, more strictly alt text.

    In different languages regexp is a extension of Regular language, but time of parsing can be extremely large and this code is non-portable.

    0 讨论(0)
  • 2021-02-01 20:20

    A few things to look out for:

    1. beginning and ending tag detection -- matched pairing
    2. recursion
    3. needing to go backwards (though you can reverse the string, but that's a hack)

    regexes, as much as I love them, aren't good at those three things. And remember, keep it simple! If you're trying to build a regex that does "everything", then you're probably doing it wrong.

    0 讨论(0)
  • 2021-02-01 20:20

    Solve the problem with a regex, then give it to somebody else conversant in regexes. If they can't tell you what it does (or at least say with confidence that they understand) in about 10 minutes, it's too complex.

    0 讨论(0)
  • 2021-02-01 20:21

    What it comes down to is using common sense. If what you are trying to match becomes an unmanageable, monster regular expression then you either need to break it up into small, logical sub-regular expressions or you need to start re-thinking your solution.

    Take email addresses (as per your example). This simple regular expression (taken from RegEx buddy) matches 99% of all emails out there:

    \b[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}\b
    

    It is short and to the point and you will rarely run into issues with it. However, as the author of RegEx buddy points out, if your email address is in the rare top-level domain ".museum" it will not be accepted.

    To truely match all email addresses you need to adhere to the standard known as RFC 2822. It outlines the multitude of ways email addresses can be formatted and it is extremely complex.

    Here is a sample regular expression attempting to adhere to RFC 2822:

    (?:[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*|"
    (?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-\x7f]|\\[\x01-\x09\x0b\x
    0c\x0e-\x7f])*")@(?:(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9]
    (?:[a-z0-9-]*[a-z0-9])?|\[(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.)
    {3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?|[a-z0-9-]*[a-z0-9]:(?:[\x01-\x08
    \x0b\x0c\x0e-\x1f\x21-\x5a\x53-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])+)\])
    

    This obviously becomes a problem of diminishing returns. It is better to use the easily maintained implementation that matches 99% of email addresses vs the monsterous one that accepts 99.9% of them.

    Regular expressions are a great tool to have in your programmers toolbox but they aren't a solution to all your parsing problems. If you find your RegEx solution starting to become extremely complex you need to either attempt to logically break it up into smaller regular expressions to match portions of your text or you need to start looking at other methods to solve your problem. Similarly, there are simply problems that Regular Expressions, due to their nature, can't solve (as one poster said, not adhering to Regular Language).

    0 讨论(0)
  • 2021-02-01 20:22

    This may sound stupid but I often lament not being able to do database type of queries using regular expression. Now especially more then before because I am entering those types of search string all the time on search engines. its very difficult, if not impossible to search for +complex AND +"regular expression"

    For example, how do I search in emacs for commands that have both Buffer and Window in their name? I need to search separately for .*Buffer.*Window and .*Window.*Buffer

    0 讨论(0)
提交回复
热议问题