a regular expression generator for number ranges

前端 未结 9 2171
长情又很酷
长情又很酷 2021-02-04 05:56

I checked on the stackExchange description, and algorithm questions are one of the allowed topics. So here goes.

Given an input of a range, where begin and ending number

相关标签:
9条回答
  • 2021-02-04 06:04

    Here's my solution and an algorithm with complexity O(log n) (n is the end of the range). I believe it is the simplest one here:

    Basically, split your task into these steps:

    1. Gradually "weaken" the start of the range.
    2. Gradually "weaken" the end of the range.
    3. Merge those two.

    By "weaken", I mean finding the end of range that can be represented by simple regex for this specific number, for example:

    145 -> 149,150 -> 199,200 -> 999,1000 -> etc.
    

    Here's a backward one, for the end of the range:

    387 -> 380,379 -> 300,299 -> 0
    

    Merging would be the process of noticing the overlap of 299->0 and 200->999 and combining those into 200->299.

    In result, you would get this set of numbers (first list intact, second one inverted:

    145, 149, 150, 199, 200, 299, 300, 379, 380, 387
    

    Now, here is the funny part. Take the numbers in pairs, and convert them to ranges:

    145-149, 150-199, 200-299, 300-379, 380-387
    

    Or in regex:

    14[5-9], 1[5-9][0-9], 2[0-9][0-9], 3[0-7][0-9], 38[0-7]
    

    Here's how the code for the weakening would look like:

    public static int next(int num) {
        //Convert to String for easier operations
        final char[] chars = String.valueOf(num).toCharArray();
        //Go through all digits backwards
        for (int i=chars.length-1; i>=0;i--) {
            //Skip the 0 changing it to 9. For example, for 190->199
            if (chars[i]=='0') {
                chars[i] = '9';
            } else { //If any other digit is encountered, change that to 9, for example, 195->199, or with both rules: 150->199
                chars[i] = '9';
                break;
            }
        }
    
        return Integer.parseInt(String.valueOf(chars));
    }
    
    //Same thing, but reversed. 387 -> 380, 379 -> 300, etc
    public static int prev(int num) {
        final char[] chars = String.valueOf(num).toCharArray();
        for (int i=chars.length-1; i>=0;i--) {
            if (chars[i] == '9') {
                chars[i] = '0';
            } else {
                chars[i] = '0';
                break;
            }
        }
    
        return Integer.parseInt(String.valueOf(chars));
    }
    

    The rest is technical details and is easy to implement. Here's an implementation of this O(log n) algorithm: https://ideone.com/3SCvZf

    Oh, and by the way, it works with other ranges too, for example for range 1-321654 the result is:

    [1-9]
    [1-9][0-9]
    [1-9][0-9][0-9]
    [1-9][0-9][0-9][0-9]
    [1-9][0-9][0-9][0-9][0-9]
    [1-2][0-9][0-9][0-9][0-9][0-9]
    3[0-1][0-9][0-9][0-9][0-9]
    320[0-9][0-9][0-9]
    321[0-5][0-9][0-9]
    3216[0-4][0-9]
    32165[0-4]
    

    And for 129-131 it's:

    129
    13[0-1]
    
    0 讨论(0)
  • 2021-02-04 06:05

    One option would be to (for a range [n, m]) generate the regexp n|n+1|...|m-1|m. However, I think you're after getting something more optimised. You can still do essentially the same, generate a FSM that matches each number using a distinct path through a state machine, then use any of the well-known FSM minimisation algorithms to generate a smaller machine, then turn that into a more condensed regular expression (since "regular expressions" without the Perl extensions are isomorphic to finite state machines).

    Let's say we are looking at the range [107, 112]:

    state1:
      1 -> state2
      * -> NotOK
    state2:
      0 -> state2.0
      1 -> state2.1
      * -> NotOK
    state2.0:
      7 -> OK
      8 -> OK
      9 -> OK
      * -> NotOK
    state2.1:
      0 -> OK
      1 -> OK
      2 -> OK
      * -> NotOK
    

    We can't really reduce this machine any further. We can see that state2.0 correspond to the RE [789] and 2.1 corresponds to [012]. We can then see that state2.0 is (0[789])|(1[012]) and the whole is 1(0[789])|(1[012]).

    Further reading on DFA minimization can be found on Wikipedia (and pages linked from there).

    0 讨论(0)
  • 2021-02-04 06:08

    You cannot cover your requirement with Character Groups only. Imagine the Range 129-131. The Pattern 1[2-3][1-9] would also match 139 which is out of range.

    So in this example you need to change the last group to something else: 1[2-3](1|9). You can now find this effect as well for the tens and hundrets, leading to the problem that aapattern that basically represents each valid number as a fixed sequence of numbers is the only working solution. (if you don't want an algorithm that needs to track overflows in order to decide whether it should use [2-8] or (8,9,0,1,2))

    if you anyway autogenerate the pattern - keep it simple:

    128-132
    

    can be written as (I left out the non-matching group addition ?: for better readability)

    (128|129|130|131|132)
    

    algorithm should be ovious, a for, an array, string concatenation and join.

    That would already work as expected, but you can also perform some "optimization" on this if you like it more compact:

    (128|129|130|131|132) <=>
    1(28|29|30|31|32) <=>
    1(2(8|9)|3(0|1|2))
    

    more optimization

    1(2([8-9])|3([0-2]))
    

    Algorithms for the last steps are out there, look for factorization. An easy way would be to push all the numbers to a tree, depending on the character position:

    1
      2
        8
        9
      3
        0
        1
        2
    

    and finally iterate over the three and form the pattern 1(2(8|9)|3(0|1|2)). As a last step, replace anything of the pattern (a|(b|)*?c) with [a-c]

    Same goes for 11-29:

    11-29 <=>
    (11|12|13|14|15|16|17|18|19|20|21|22|23|24|25|26|27|28|29) <=>   
    (1(1|2|3|4|5|7|8|9)|2(1|2|3|4|5|7|8|9)) <=>
    (1([1-9])|2([1-9]) 
    

    as an addition you now can proceed with the factorization:

    (1([1-9])|2([1-9]) <=>
    (1|2)[1-9] <=>
    [1-2][1-9]
    
    0 讨论(0)
  • 2021-02-04 06:09

    [Hint: somehow the idea of applying recursion as presented in my first answer (using Python) did not reach the OP, so here it is again in Java. Note that for a recursive solution it is often easier to prove correctness.]

    The key observation to use recursion is that ranges starting with a number ending in 0 and ending with a number ending in 9 are covered by digit patterns that all end in [0-9].

    20-239 is covered by [2-9][0-9], 1[0-9][0-9], 2[0-3][0-9]
    

    When taking off the last digit of start and end of the range the resulting range is covered by the same digit patterns, except for the missing trailing [0-9]:

    20-239 is covered by [2-9][0-9], 1[0-9][0-9], 2[0-3][0-9]
    2 -23  is covered by [2-9],      1[0-9],      2[0-3]
    

    So when we are looking for the digit patterns that cover a range (e.g. 13-247), we split off a range before the first number ending in 0 and a range after the last number ending in 9 (note that these split off ranges can be empty), e.g.

    13-247 = 13-19, 20-239, 240-247
    20-247 =        20-239, 240-247
    13-239 = 13-19, 20-239
    20-239 =        20-239
    

    The remaining range is handled recursively by taking off the last digits and appending [0-9] to all digit patterns of the reduced range.

    When generating pairs start,end for the subranges that can be covered by one digit pattern (as done by bezmax and the OP), the subranges of the reduced range have to be "blown up" correspondingly.

    The special cases when there is no number ending in 0 in the range or when there is no number ending in 9 in the range can only happen if start and end only differ at the last digit; in this case the whole range can be covered by one digit pattern.

    So here is an alternative implementation of getRegexPairs based on this recursion principle:

    private static List<Integer> getRegexPairs(int start, int end)
    {
      List<Integer> pairs = new ArrayList<>();   
      if (start > end) return pairs; // empty range
      int firstEndingWith0 = 10*((start+9)/10); // first number ending with 0
      if (firstEndingWith0 > end) // not in range?
      {
        // start and end differ only at last digit
        pairs.add(start);
        pairs.add(end);
        return pairs;
      }
    
      if (start < firstEndingWith0) // start is not ending in 0
      {
        pairs.add(start);
        pairs.add(firstEndingWith0-1);
      }
    
      int lastEndingWith9 = 10*(end/10)-1; // last number in range ending with 9
      // all regex for the range [firstEndingWith0,lastEndingWith9] end with [0-9]
      List<Integer> pairsMiddle = getRegexPairs(firstEndingWith0/10, lastEndingWith9/10);
      for (int i=0; i<pairsMiddle.size(); i+=2)
      {
        // blow up each pair by adding all possibilities for appended digit
        pairs.add(pairsMiddle.get(i)  *10+0);
        pairs.add(pairsMiddle.get(i+1)*10+9);
      }
    
      if (lastEndingWith9 < end) // end is not ending in 9
      {
        pairs.add(lastEndingWith9+1);
        pairs.add(end);
      }
    
      return pairs;
    }
    
    0 讨论(0)
  • 2021-02-04 06:09

    If you find regex pattern range between 5 and 300 which also support float; there is the best answer created by me ...

    ^0*(([5-9]([.][0-9]{1,2})?)|[1-9][0-9]{1}?([.][0-9]{1,2})?|[12][0-9][0-9]([.][0-9]{1,2})?|300([.]0{1,2})?)$
    

    for 1 to 300 range

    ^0*([1-9][0-9]?([.][0-9]{1,2})?|[12][0-9][0-9]([.][0-9]{1,2})?|300([.]0{1,2})?)$
    
    0 讨论(0)
  • 2021-02-04 06:09

    Bezmax's answer is close but doesn't quite solve the problem correctly. It has a few details incorrect I believe. I have fixed the issues and written the algorithm in c++. The main problem in Bezmax's algorithm is as follows:

    The prev function should produce the following: 387 -> 380,379 -> 300,299 -> 100, 99->10, 9->0 Whereas Bezmax had: 387 -> 380,379 -> 300,299 -> 0

    Bezmax had 299 "weakening" to 0 this could leave part of the range out in certain circumstances. Basically you want to weaken to the lowest number you can but never change the number of digits. The full solution is too much code to post here but here is the important parts. Hope this helps someone.

        // Find the next number that is advantageous for regular expressions.
        //
        // Starting at the right most decimal digit convert all zeros to nines. Upon
        // encountering the first non-zero convert it to a nine and stop. The output
        // always has the number of digits as the input.
        // examples: 100->999, 0->9, 5->9, 9->9, 14->19, 120->199, 10010->10099
        static int Next(int val)
        {
           assert(val >= 0);
    
           // keep track of how many nines to add to val.
           int addNines = 0;
    
           do {
              auto res = std::div(val, 10);
              val = res.quot;
              ++addNines;
              if (res.rem != 0) {
                 break;
              }
           } while (val != 0);
    
           // add the nines
           for (int i = 0; i < addNines; ++i) {
              val = val * 10 + 9;
           }
    
           return val;
        }
    
        // Find the previous number that is advantageous for regular expressions.
        //
        // If the number is a single digit number convert it to zero and stop. Else...
        // Starting at the right most decimal digit convert all trailing 9's to 0's
        // unless the digit is the most significant digit - change that 9 to a 1. Upon
        // encounter with first non-nine digit convert it to a zero (or 1 if most
        // significant digit) and stop. The output always has the same number of digits
        // as the input.
        // examples: 0->0, 1->0, 29->10, 999->100, 10199->10000, 10->10, 399->100
        static int Prev(int val)
        {
           assert(val >= 0);
    
           // special case all single digit numbers reduce to 0
           if (val < 10) {
              return 0;
           }
    
           // keep track of how many zeros to add to val.
           int addZeros = 0;
    
           for (;;) {
              auto res = std::div(val, 10);
              val = res.quot;
              ++addZeros;
              if (res.rem != 9) {
                 break;
              }
    
              if (val < 10) {
                 val = 1;
                 break;
              }
           }
    
           // add the zeros
           for (int i = 0; i < addZeros; ++i) {
              val *= 10;
           }
    
           return val;
        }
    
        // Create a vector of ranges that covers [start, end] that is advantageous for
        // regular expression creation. Must satisfy end>=start>=0.
        static std::vector<std::pair<int, int>> MakeRegexRangeVector(const int start,
                                                                     const int end)
        {
           assert(start <= end);
           assert(start >= 0);
    
           // keep track of the remaining portion of the range not yet placed into
           // the forward and reverse vectors.
           int remainingStart = start;
           int remainingEnd = end;
    
           std::vector<std::pair<int, int>> forward;
           while (remainingStart <= remainingEnd) {
              auto nextNum = Next(remainingStart);
              // is the next number within the range still needed.
              if (nextNum <= remainingEnd) {
                 forward.emplace_back(remainingStart, nextNum);
                 // increase remainingStart as portions of the numeric range are
                 // transfered to the forward vector.
                 remainingStart = nextNum + 1;
              } else {
                 break;
              }
           }
           std::vector<std::pair<int, int>> reverse;
           while (remainingEnd >= remainingStart) {
              auto prevNum = Prev(remainingEnd);
              // is the previous number within the range still needed.
              if (prevNum >= remainingStart) {
                 reverse.emplace_back(prevNum, remainingEnd);
                 // reduce remainingEnd as portions of the numeric range are transfered
                 // to the reverse vector.
                 remainingEnd = prevNum - 1;
              } else {
                 break;
              }
           }
    
           // is there any part of the range not accounted for in the forward and
           // reverse vectors?
           if (remainingStart <= remainingEnd) {
              // add the unaccounted for part - this is guaranteed to be expressable
              // as a single regex substring.
              forward.emplace_back(remainingStart, remainingEnd);
           }
    
           // Concatenate, in reverse order, the reverse vector to forward.
           forward.insert(forward.end(), reverse.rbegin(), reverse.rend());
    
           // Some sanity checks.
           // size must be non zero.
           assert(forward.size() > 0);
    
           // verify starting and ending points of the range
           assert(forward.front().first == start);
           assert(forward.back().second == end);
    
           return forward;
        }
    
    0 讨论(0)
提交回复
热议问题