Regex implementation that can handle machine-generated regex's: *non-backtracking*, O(n)?

后端 未结 5 2080
暖寄归人
暖寄归人 2020-12-24 09:15

Edit 2: For a practical demonstration of why this remains important, look no further than stackoverflow\'s own regex-caused outage today (2016-07-2

相关标签:
5条回答
  • 2020-12-24 09:56

    Where can I find robustly fast Regex implementation?

    You can't.

    Someone has to say it, the answer to this question given the restrictions is surely you can't - its unlikely you will find an implementation matching your constraints.

    Btw, I am sure you have already tried so, but have you compiled the regex (with the option that outputs to an assembly) - I say because:

    if you have a complex Regex and millions of short strings to test

    0 讨论(0)
  • 2020-12-24 10:04

    If you can handle using unsafe code (and the licensing issue) you could take the implementation from this TRE windows port.

    You might be able to use this directly with P/Invoke and explicit layout structs for the following:

    typedef int regoff_t;
    typedef struct {
      size_t re_nsub;  /* Number of parenthesized subexpressions. */
      void *value;     /* For internal use only. */
    } regex_t;
    
    typedef struct {
      regoff_t rm_so;
      regoff_t rm_eo;
    } regmatch_t;
    
    
    typedef enum {
      REG_OK = 0,       /* No error. */
      /* POSIX regcomp() return error codes.  (In the order listed in the
         standard.)  */
      REG_NOMATCH,      /* No match. */
      REG_BADPAT,       /* Invalid regexp. */
      REG_ECOLLATE,     /* Unknown collating element. */
      REG_ECTYPE,       /* Unknown character class name. */
      REG_EESCAPE,      /* Trailing backslash. */
      REG_ESUBREG,      /* Invalid back reference. */
      REG_EBRACK,       /* "[]" imbalance */
      REG_EPAREN,       /* "\(\)" or "()" imbalance */
      REG_EBRACE,       /* "\{\}" or "{}" imbalance */
      REG_BADBR,        /* Invalid content of {} */
      REG_ERANGE,       /* Invalid use of range operator */
      REG_ESPACE,       /* Out of memory.  */
      REG_BADRPT            /* Invalid use of repetition operators. */
    } reg_errcode_t;
    

    Then use the exports capable of handling strings with embedded nulls (with wide character support)

    /* Versions with a maximum length argument and therefore the capability to
       handle null characters in the middle of the strings (not in POSIX.2). */
    int regwncomp(regex_t *preg, const wchar_t *regex, size_t len, int cflags);
    
    int regwnexec(const regex_t *preg, const wchar_t *string, size_t len,
          size_t nmatch, regmatch_t pmatch[], int eflags);
    

    Alternatively wrap it via a C++/CLI solution for easier translation and more flexibility (I would certainly suggest this is sensible if you are comfortable with C++/CLI).

    0 讨论(0)
  • 2020-12-24 10:05

    A quick comment: Just because you can simulate DFA construction by simulating with multiple states does not mean you are not doing the work of the NFA-DFA conversion. The difference is that you are distributing the effort over the search itself. I.e., worst case performance is unchanged.

    0 讨论(0)
  • 2020-12-24 10:10

    Consider how DFAs are created from regular expressions:

    You start with a regular expression. Each operation (concat, union, Kleene closure) represents a transition between states in an NFA. The resulting DFA's states represent power sets of the states in the NFA. The states in the NFA are linear to the size of the regular expression, and therefore the DFA's states are exponential to the size of the regular expression.

    So your first constraint,

    have a worst case time-complexity of regex evaluation of O(m*n) where m is the length of the regex, and n the length of the input

    Is impossible. The regex needs to be compiled to a 2^m-state DFA (worst case), which won't be done in linear time.

    This is always the case with all but the simplest regular expressions. Ones that are so simple you can just write a quick .contains expression more easily.

    0 讨论(0)
  • 2020-12-24 10:16

    First, what your suggesting is possible and you certainly know your subject. You also know that the trade-off of not using back-referencing implementations is memory. If you control your environment enough this is likely a reasonable approach.

    The only thing I will comment on before continuing is that I would encourage you to question the choice of using RegEx. You are clearly more familiar with your specific problem and what your trying to solve so only you can answer the question. I don't think ANTLR would be a good alternative; however, A home-brew rules engine (if limited in scope) can be highly tuned to your specific needs. It all depends on your specific problem.

    For those reading this and 'missing the point', here is some background reading:

    • Regular Expression Matching Can Be Simple And Fast

    From the same site, there are a number of implementations linked on this page.

    The gist of the entire discussion of the above article is that the best answer is to use both. To that end, the only widely used implementation I'm aware of is the one used by the TCL language. As I understand it was originally written by Henry Spencer and it employs this hybrid approach. There have been a few attempts at porting it to a c library, though I'm not aware of any that are in wide use. Walter Waldo's and Thomas Lackner's are both mentioned and linked here. Also mentioned is the boost library though I'm not sure of the implementation. You can also look at the TCL code itself (linked from their site) and work from there.

    In short, I'd go with TRE or Plan 9 as these are both actively supported.

    Obviously none of these are C#/.Net and I'm not aware of one that is.

    0 讨论(0)
提交回复
热议问题