Why is creating a range from 0 to log(len(list), 2) so slow?

问题

I don't have a clue why is this happening. I was messing with some lists, and I needed a for loop going from 0 to log(n, 2) where n was the length of a list. But the code was amazingly slow, so after a bit a research I found that the problem is in the range generation. Sample code for demonstration:

n = len([1,2,3,4,5,6,7,8])
k = 8
timeit('range(log(n, 2))', number=2, repeat=3) # Test 1
timeit('range(log(k, 2))', number=2, repeat=3) # Test 2

The output

2 loops, best of 3: 2.2 s per loop
2 loops, best of 3: 3.46 µs per loop

The number of tests is low (I didn't want this to be running more than 10 minutes), but it already shows that range(log(n, 2)) is orders of magnitude slower than the counterpart using just the logarithm of an integer. This is really surprising and I don't have any clue on why is this happening. Maybe is a problem on my PC, maybe a Sage problem or a Python bug (I didn't try the same on Python).

Using xrange instead of range doesn't help either. Also, if you get the number with .n(), test 1 runs at the same speed of 2.

Does anybody know what can be happening? Thanks!

回答1:

Good grief -- I recognize this one. It's related to one of mine, trac #12121. First, you get extra overhead from using a Python int as opposed to a Sage Integer for boring reasons:

sage: log(8, 2)
3
sage: type(log(8, 2))
sage.rings.integer.Integer
sage: log(8r, 2)
log(8)/log(2)
sage: type(log(8r, 2))
sage.symbolic.expression.Expression
sage: %timeit log(8, 2)
1000000 loops, best of 3: 1.4 us per loop
sage: %timeit log(8r, 2)
1000 loops, best of 3: 404 us per loop

(The r suffix means "raw", and prevents the Sage preparser from wrapping the literal 2 into Integer(2))

And then it gets weird. In order to produce an int for range to consume, Sage has to figure out how to turn log(8)/log(2) into 3, and it turns out that she does the worst thing possible. Plagiarizing my original diagnosis (mutatis mutandis):

First she checks to see if this object has its own way to get an int, and it doesn't. So she builds a RealInterval object out of log(8)/log(2), and it turns out that this is about the worst thing she could do! She checks to see whether the lower and upper parts of the interval agree [on the floor, I mean] (so that she knows for certain what the floor is). But in this case, because it really is an integer! this is always going to look like:

sage: y = log(8)/log(2)
sage: rif = RealIntervalField(53)(y)
sage: rif
3.000000000000000?
sage: rif.endpoints()
(2.99999999999999, 3.00000000000001)

These two bounds have floors which aren't aren't equal, so Sage decides she hasn't solved the problem yet, and she keeps increasing the precision to 20000 bits to see if she can prove that they are.. but by construction it's never going to work. Finally she gives up and tries to simplify it, which succeeds:

sage: y.simplify_full()
3

Proof without words that it's a perverse property of the exactly divisible case:

sage: %timeit range(log(8r, 2))
1 loops, best of 3: 2.18 s per loop
sage: %timeit range(log(9r, 2))
1000 loops, best of 3: 766 us per loop
sage: %timeit range(log(15r, 2))
1000 loops, best of 3: 764 us per loop
sage: %timeit range(log(16r, 2))
1 loops, best of 3: 2.19 s per loop

回答2:

This looks like it's a Sage bug.

I created a new notebook and did this:

n = len([1,2,3,4,5,6,7,8])
k = 8
timeit('range(log(n, 2))', number=2, repeat=3) # Test 1
timeit('range(log(len([1,2,3,4,5,6,7,8]), 2))', number=2, repeat=3) # Test 1.5
timeit('range(log(k, 2))', number=2, repeat=3) # Test 2

Test 1.5 is just as slow as test 1. But if you break it down in any way—take off the range, or even add m=n+0 and use m instead of n, it drops down to microseconds.

So clearly, Sage is trying to do something complicated here while evaluating the expression, and getting confused.

To verify this, in plain old ipython:

n = len([1,2,3,4,5,6,7,8])
k = 8
%timeit 'range(log(n, 2))'
%timeit 'range(log(len([1,2,3,4,5,6,7,8]), 2))'
%timeit 'range(log(k, 2))'

They're all equally fast, as you'd expect.

So… what do you do about it?

Well, you may want to try to track down the Sage bug and file it upstream. But meanwhile, you probably want a workaround in your code.

As noted above, just doing m = n+0 and using m instead of n seems to speed it up. See if that works for you?

回答3:

Python 2 allows range(some_float), but its deprecated and doesn't work in python 3.

The code sample doesn't give the output specified. But we can walk through it. First, timeit needs a full script, the import in the script calling timeit is not used:

>>> timeit('range(log(8,2))')
  Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib64/python2.6/timeit.py", line 226, in timeit
    return Timer(stmt, setup, timer).timeit(number)
  File "/usr/lib64/python2.6/timeit.py", line 192, in timeit
    timing = self.inner(it, self.timer)
  File "<timeit-src>", line 6, in inner
NameError: global name 'log' is not defined

If you add the import to the script being timed, it includes the setup time:

>>> timeit('from math import log;range(log(8,2))')
3.7010221481323242

If you move the import to the setup, its better, but timing a one-shot is notoriously inaccurate:

>>> timeit('range(log(8,2))',setup='from math import log')
1.9139349460601807

Finally, run it a bunch of times and you get a good number:

>>> timeit('range(log(8,2))',setup='from math import log',number=100)
0.00038290023803710938

回答4:

Maybe using log(x, 2) (aka ld()) isn't a good idea in the first place. I'd propose to use shifting the int values to implement the ld():

n = len(array)
while n:
  n >>= 1
  # perform the loop stuff

This way you might avoid all these uglinesses with the range() and the log().

In normal situations calling log() should take more time than simple bit shifting on an int. Examples:

>>> timeit('for i in range(int(math.log(8, 2))): pass', setup='import math')
0.6762251853942871
>>> timeit('n = 8\nwhile n:\n  n >>= 1')
0.24107813835144043

With larger values for n the difference gets smaller. For n = 10000 I got 0.8163230419158936 and 0.8106038570404053, but that should be because then the loop body will take the majority of the time, compared to the loop initialization.

来源：https://stackoverflow.com/questions/16289354/why-is-creating-a-range-from-0-to-loglenlist-2-so-slow

标签

python

performance

sage