Simplification Algorithm for Reverse Polish Notation

问题

A couple of days ago I played around with Befunge which is an esoteric programming language. Befunge uses a LIFO stack to store data. When you write programs the digits from 0 to 9 are actually Befunge-instructions which push the corresponding values onto the stack. So for exmaple this would push a 7 to stack:

34+

In order to push a number greater than 9, calculations must be done with numbers less than or equal to 9. This would yield 123.

99*76*+

While solving Euler Problem 1 with Befunge I had to push the fairly large number 999 to the stack. Here I began to wonder how I could accomplish this task with as few instructions as possible. By writing a term down in infix notation and taking out common factors I came up with

9993+*3+*

One could also simply multiply two two-digit numbers which produce 999, e.g.

39*66*1+*

I thought about this for while and then decided to write a program which puts out the smallest expression according to these rules in reverse polish notation for any given integer. This is what I have so far (written in NodeJS with underscorejs):

var makeExpr = function (value) {
    if (value < 10) return value + "";
    var output = "", counter = 0;
    (function fn (val) {
        counter++;
        if(val < 9) { output  += val; return; };
        var exp = Math.floor(Math.log(val) / Math.log(9));
        var div = Math.floor(val / Math.pow(9, exp));
        _( exp ).times(function () { output += "9"; });
        _(exp-1).times(function () { output += "*"; });
        if (div > 1) output += div + "*";
        fn(val - Math.pow(9, exp) * div);    
    })(value);
    _(counter-1).times(function () { output+= "+"; });
    return output.replace(/0\+/, "");
};

makeExpr(999);
// yields 999**99*3*93*++

This piece of code constructs the expression naively and is obvously way to long. Now my questions:

Is there an algorithm to simplify expressions in reverse polish notation?
Would simplification be easier in infix notation?
Can an expression like 9993+*3+* be proofed to be the smallest one possible?

I hope you can give some insights. Thanks in advance.

回答1:

There's also 93*94*1+*, which is basically 27*37.

Were I to attack this problem, I'd start by first trying to evenly divide the number. So given 999 I would divide by 9 and get 111. Then I'd try to divide by 9, 8, 7, etc. until I discovered that 111 is 3*37.

37 is prime, so I go greedy and divide by 9, giving me 4 with a remainder of 1.

That seems to give me optimum results for the half dozen I've tried. It's a little expensive, of course, testing for even divisibility. But perhaps not more expensive than generating a too-long expression.

Using this, 100 becomes 55*4*. 102 works out to 29*5*6+.

101 brings up an interesting case. 101/9 = (9*11) + 2. Or, alternately, (9*9)+20. Let's see:

983+*2+  (9*11) + 2
99*45*+  (9*9) + 20

Whether it's easier to generate the postfix directly or generate infix and convert, I really don't know. I can see benefits and drawbacks to each.

Anyway, that's the approach I'd take: try to divide evenly at first, and then be greedy dividing by 9. Not sure exactly how I'd structure it.

I'd sure like to see your solution once you figure it out.

Edit

This is an interesting problem. I came up with a recursive function that does a credible job of generating postfix expressions, but it's not optimum. Here it is in C#.

string GetExpression(int val)
{
    if (val < 10)
    {
        return val.ToString();
    }
    int quo, rem;
    // first see if it's evenly divisible
    for (int i = 9; i > 1; --i)
    {
        quo = Math.DivRem(val, i, out rem);
        if (rem == 0)
        {
            // If val < 90, then only generate here if the quotient
            // is a one-digit number. Otherwise it can be expressed
            // as (9 * x) + y, where x and y are one-digit numbers.
            if (val >= 90 || (val < 90 && quo <= 9))
            {
                // value is (i * quo)
                return i + GetExpression(quo) + "*";
            }
        }
    }

    quo = Math.DivRem(val, 9, out rem);
    // value is (9 * quo) + rem
    // optimization reduces (9 * 1) to 9
    var s1 = "9" + ((quo == 1) ? string.Empty : GetExpression(quo) + "*");
    var s2 = GetExpression(rem) + "+";
    return s1 + s2;
}

For 999 it generates 9394*1+**, which I believe is optimum.

This generates optimum expressions for values <= 90. Every number from 0 to 90 can be expressed as the product of two one-digit numbers, or by an expression of the form (9x + y), where x and y are one-digit numbers. However, I don't know that this guarantees an optimum expression for values greater than 90.

回答2:

When only considering multiplication and addition, it's pretty easy to construct optimal formula's, because that problem has the optimal substructure property. That is, the optimal way to build [num1][num2]op is from num1 and num2 that are both also optimal. If duplication is also considered, that's no longer true.

The num1 and num2 give rise to overlapping subproblems, so Dynamic Programming is applicable.

We can simply, for a number i:

For every 1 < j <= sqrt(i) that evenly divides i, try [j][i / j]*
For every 0 < j < i/2, try [j][i - j]+
Take the best found formula

That is of course very easy to do bottom-up, just start at i = 0 and work your way up to whatever number you want. Step 2 is a little slow, unfortunately, so after say 100000 it starts to get annoying to wait for it. There might be some trick that I'm not seeing.

Code in C# (not tested super well, but it seems to work):

string[] n = new string[10000];
for (int i = 0; i < 10; i++)
    n[i] = "" + i;
for (int i = 10; i < n.Length; i++)
{
    int bestlen = int.MaxValue;
    string best = null;
    // try factors
    int sqrt = (int)Math.Sqrt(i);
    for (int j = 2; j <= sqrt; j++)
    {
        if (i % j == 0)
        {
            int len = n[j].Length + n[i / j].Length + 1;
            if (len < bestlen)
            {
                bestlen = len;
                best = n[j] + n[i / j] + "*";
            }
        }
    }
    // try sums
    for (int j = 1; j < i / 2; j++)
    {
        int len = n[j].Length + n[i - j].Length + 1;
        if (len < bestlen)
        {
            bestlen = len;
            best = n[j] + n[i - j] + "+";
        }
    }
    n[i] = best;
}

Here's a trick to optimize searching for the sums. Suppose there is an array that contains, for every length, the highest number that can be made with that length. An other thing that is perhaps less obvious that this array also gives us, is a quick way to determine the shortest number that is bigger than some threshold (by simply scanning through the array and noting the first position that crosses the threshold). Together, that gives a quick way to discard huge portions of the search space.

For example, the biggest number of length 3 is 81 and the biggest number of length 5 is 728. Now if we want to know how to get 1009 (prime, so no factors found), first we try the sums where the first part has length 1 (so 1+1008 through 9+1000), finding 9+1000 which is 9 characters long (95558***+).

The next step, checking the sums where the first part has length 3 or less, can be skipped completely. 1009 - 81 = 929, and 929 (the lowest that the second part of the sum can be if the first part is to be 3 characters or less) is bigger than 728 so numbers of 929 and over must be at least 7 characters long. So if the first part of the sum is 3 characters, the second part must be at least 7 characters, and then there's also a + sign on the end, so the total is at least 11 characters. The best so far was 9, so this step can be skipped.

The next step, with 5 characters in the first part, can also be skipped, because 1009 - 728 = 280, and to make 280 or high we need at least 5 characters. 5 + 5 + 1 = 11, bigger than 9, so don't check.

Instead of checking about 500 sums, we only had to check 9 this way, and the check to make the skipping possible is very quick. This trick is good enough that generating all numbers up to a million only takes 3 seconds on my PC (before, it would take 3 seconds to get to 100000).

Here's the code:

string[] n = new string[100000];
int[] biggest_number_of_length = new int[n.Length];
for (int i = 0; i < 10; i++)
    n[i] = "" + i;
biggest_number_of_length[1] = 9;
for (int i = 10; i < n.Length; i++)
{
    int bestlen = int.MaxValue;
    string best = null;
    // try factors
    int sqrt = (int)Math.Sqrt(i);
    for (int j = 2; j <= sqrt; j++)
    {
        if (i % j == 0)
        {
            int len = n[j].Length + n[i / j].Length + 1;
            if (len < bestlen)
            {
                bestlen = len;
                best = n[j] + n[i / j] + "*";
            }
        }
    }
    // try sums
    for (int x = 1; x < bestlen; x += 2)
    {
        int find = i - biggest_number_of_length[x];
        int min = int.MaxValue;
        // find the shortest number that is >= (i - biggest_number_of_length[x])
        for (int k = 1; k < biggest_number_of_length.Length; k += 2)
        {
            if (biggest_number_of_length[k] >= find)
            {
                min = k;
                break;
            }
        }
        // if that number wasn't small enough, it's not worth looking in that range
        if (min + x + 1 < bestlen)
        {
            // range [find .. i] isn't optimal
            for (int j = find; j < i; j++)
            {
                int len = n[i - j].Length + n[j].Length + 1;
                if (len < bestlen)
                {
                    bestlen = len;
                    best = n[i - j] + n[j] + "+";
                }
            }
        }
    }
    // found
    n[i] = best;
    biggest_number_of_length[bestlen] = i;
}

There's still room for improvement. This code will re-check sums that it has already checked. There are simple ways to make it at least not check the same sum twice (by remembering the last find), but that made no significant difference in my tests. It should be possible to find a better upper bound.

回答3:

There is 44 solutions for 999 with lenght 9:

39149*+**
39166*+**
39257*+**
39548*+**
39756*+**
39947*+**
39499**+*
39669**+*
39949**+*
39966**+*
93149*+**
93166*+**
93257*+**
93548*+**
93756*+**
93947*+**
93269**+*
93349**+*
93366**+*
93439**+*
93629**+*
93636**+*
93926**+*
93934**+*
93939+*+*
93948+*+*
93957+*+*
96357**+*
96537**+*
96735**+*
96769+*+*
96778+*+*
97849+*+*
97858+*+*
97867+*+*
99689+*+*
956*99*+*
968*79*+*
39*149*+*
39*166*+*
39*257*+*
39*548*+*
39*756*+*
39*947*+*

Edit:

I have working on some search space pruning improvements so sorry I have not posted it immediately. There is script in Erlnag. Original one takes 14s for 999 but this one makes it in around 190ms.

Edit2:

There is 1074 solutions of length 13 for 9999. It takes 7 minutes and there is some of them below:

329+9677**+**
329+9767**+**
338+9677**+**
338+9767**+**
347+9677**+**
347+9767**+**
356+9677**+**
356+9767**+**
3147789+***+*
31489+77***+*
3174789+***+*
3177489+***+*
3177488*+**+*

There is version in C with more aggressive pruning of state space and returns only one solution. It is way faster.

$ time ./polish_numbers 999
Result for 999: 39149*+**, length 9

real    0m0.008s
user    0m0.004s
sys     0m0.000s

$ time ./polish_numbers 99999
Result for 99999: 9158*+1569**+**, length 15

real    0m34.289s
user    0m34.296s
sys     0m0.000s

harold was reporting his C# bruteforce version makes same number in 20s so I was curious if I can improve mine. I have tried better memory utilization by refactoring data structure. Searching algorithm mostly works with length of solution and it's existence so I separated this information to one structure (best_rec_header). I have also make solution as tree branches separated in another (best_rec_args). Those data are used only when new better solution for given number. There is code.

Result for 99999: 9158*+1569**+**, length 15

real    0m31.824s
user    0m31.812s
sys     0m0.012s

It was still too much slow. So I tried some other versions. First I added some statistics to demonstrate that mine code is not computing all smaller numbers.

Result for 99999: 9158*+1569**+**, length 15, (skipped 36777, computed 26350)

Then I have tried change code to compute + solutions for bigger numbers first.

Result for 99999: 1956**+9158*+**, length 15, (skipped 0, computed 34577)

real    0m17.055s
user    0m17.052s
sys     0m0.008s

It was almost as twice faster. But there was another idea that may be sometimes I give up find solution for some number as limited by current best_len limit. So I tried to make small numbers (up to half of n) unlimited (note 255 as best_len limit for first of operands finding).

Result for 99999: 9158*+1569**+**, length 15, (skipped 36777, computed 50000)

real    0m12.058s
user    0m12.048s
sys     0m0.008s

Nice improvement but what if I limit solutions for those numbers by best solution found so far. It needs some sort of computation global state. Code becomes more complicated but result even faster.

Result for 99999: 97484777**+**+*, length 15, (skipped 36997, computed 33911)

real    0m10.401s
user    0m10.400s
sys     0m0.000s

It was even able to compute ten times bigger number.

Result for 999999: 37967+2599**+****, length 17, (skipped 440855)

real    12m55.085s
user    12m55.168s
sys     0m0.028s

Then I decided to try also brute force method and this was even faster.

Result for 99999: 9158*+1569**+**, length 15

real    0m3.543s
user    0m3.540s
sys     0m0.000s

Result for 999999: 37949+2599**+****, length 17

real    5m51.624s
user    5m51.556s
sys     0m0.068s

Which shows, that constant matter. It is especially true for modern CPU when brute force approach gets advantage from better vectorization, better CPU cache utilization and less branching.

Anyway, I think there is some better approach using better understanding of number theory or space searching by algorithms as A* and so. And for really big numbers there may be good idea to use genetic algorithms.

Edit3:

harold came with new idea to eliminate trying to much sums. I have implemented it in this new version. It is order of magnitude faster.

$ time ./polish_numbers 99999
Result for 99999: 9158*+1569**+**, length 15

real    0m0.153s
user    0m0.152s
sys     0m0.000s
$ time ./polish_numbers 999999
Result for 999999: 37949+2599**+****, length 17

real    0m3.516s
user    0m3.512s
sys     0m0.004s
$ time ./polish_numbers 9999999
Result for 9999999: 9788995688***+***+*, length 19

real    1m39.903s
user    1m39.904s
sys     0m0.032s

回答4:

Don't forget, you can also push ASCII values!! Usually, this is longer, but for higher numbers it can get much shorter:

If you needed the number 123, it would be much better to do "{" than 99*76*+

来源：https://stackoverflow.com/questions/20153412/simplification-algorithm-for-reverse-polish-notation

标签

algorithm

algebra

simplification

postfix-notation