Python performance characteristics

前端 未结 6 363
再見小時候
再見小時候 2020-12-08 03:23

I\'m in the process of tuning a pet project of mine to improve its performance. I\'ve already busted out the profiler to identify hotspots but I\'m thinking understanding Py

6条回答
  •  囚心锁ツ
    2020-12-08 03:50

    Python's compiler is deliberately dirt-simple -- this makes it fast and highly predictable. Apart from some constant folding, it basically generates bytecode that faithfully mimics your sources. Somebody else already suggested dis, and it's indeed a good way to look at the bytecode you're getting -- for example, how for i in [1, 2, 3]: isn't actually doing constant folding but generating the literal list on the fly, while for i in (1, 2, 3): (looping on a literal tuple instead of a literal list) is able to constant-fold (reason: a list is a mutable object, and to keep to the "dirt-simple" mission statement the compiler doesn't bother to check that this specific list is never modified so it could be optimized into a tuple).

    So there's space for ample manual micro-optimization -- hoisting, in particular. I.e., rewrite

    for x in whatever():
        anobj.amethod(x)
    

    as

    f = anobj.amethod
    for x in whatever():
        f(x)
    

    to save the repeated lookups (the compiler doesn't check whether a run of anobj.amethod can actually change anobj's bindings &c so that a fresh lookup is needed next time -- it just does the dirt-simple thing, i.e., no hoisting, which guarantees correctness but definitely doesn't guarantee blazing speed;-).

    The timeit module (best used at a shell prompt IMHO) makes it very simple to measure the overall effects of compilation + bytecode interpretation (just ensure the snippet you're measuring has no side effects that would affect the timing, since timeit does run it over and over in a loop;-). For example:

    $ python -mtimeit 'for x in (1, 2, 3): pass'
    1000000 loops, best of 3: 0.219 usec per loop
    $ python -mtimeit 'for x in [1, 2, 3]: pass'
    1000000 loops, best of 3: 0.512 usec per loop
    

    you can see the costs of the repeated list construction -- and confirm that is indeed what we're observing by trying a minor tweak:

    $ python -mtimeit -s'Xs=[1,2,3]' 'for x in Xs: pass'
    1000000 loops, best of 3: 0.236 usec per loop
    $ python -mtimeit -s'Xs=(1,2,3)' 'for x in Xs: pass'
    1000000 loops, best of 3: 0.213 usec per loop
    

    moving the iterable's construction to the -s setup (which is run only once and not timed) shows that the looping proper is slightly faster on tuples (maybe 10%), but the big issue with the first pair (list slower than tuple by over 100%) is mostly with the construction.

    Armed with timeit and the knowledge that the compiler's deliberately very simple minded in its optimizations, we can easily answer other questions of yours:

    How fast are the following operations (comparatively)

    * Function calls
    * Class instantiation
    * Arithmetic
    * 'Heavier' math operations such as sqrt()
    
    $ python -mtimeit -s'def f(): pass' 'f()'
    10000000 loops, best of 3: 0.192 usec per loop
    $ python -mtimeit -s'class o: pass' 'o()'
    1000000 loops, best of 3: 0.315 usec per loop
    $ python -mtimeit -s'class n(object): pass' 'n()'
    10000000 loops, best of 3: 0.18 usec per loop
    

    so we see: instantiating a new-style class and calling a function (both empty) are about the same speed, with instantiations possibly having a tiny speed margin, maybe 5%; instantiating an old-style class is slowest (by about 50%). Tiny differences such as 5% or less of course could be noise, so repeating each try a few times is advisable; but differences like 50% are definitely well beyond noise.

    $ python -mtimeit -s'from math import sqrt' 'sqrt(1.2)'
    1000000 loops, best of 3: 0.22 usec per loop
    $ python -mtimeit '1.2**0.5'
    10000000 loops, best of 3: 0.0363 usec per loop
    $ python -mtimeit '1.2*0.5'
    10000000 loops, best of 3: 0.0407 usec per loop
    

    and here we see: calling sqrt is slower than doing the same computation by operator (using the ** raise-to-power operator) by roughly the cost of calling an empty function; all arithmetic operators are roughly the same speed to within noise (the tiny difference of 3 or 4 nanoseconds is definitely noise;-). Checking whether constant folding might interfere:

    $ python -mtimeit '1.2*0.5'
    10000000 loops, best of 3: 0.0407 usec per loop
    $ python -mtimeit -s'a=1.2; b=0.5' 'a*b'
    10000000 loops, best of 3: 0.0965 usec per loop
    $ python -mtimeit -s'a=1.2; b=0.5' 'a*0.5'
    10000000 loops, best of 3: 0.0957 usec per loop
    $ python -mtimeit -s'a=1.2; b=0.5' '1.2*b'
    10000000 loops, best of 3: 0.0932 usec per loop
    

    ...we see that this is indeed the case: if either or both numbers are being looked up as variables (which blocks constant folding), we're paying the "realistic" cost. Variable lookup has its own cost:

    $ python -mtimeit -s'a=1.2; b=0.5' 'a'
    10000000 loops, best of 3: 0.039 usec per loop
    

    and that's far from negligible when we're trying to measure such tiny times anyway. Indeed constant lookup isn't free either:

    $ python -mtimeit -s'a=1.2; b=0.5' '1.2'
    10000000 loops, best of 3: 0.0225 usec per loop
    

    as you see, while smaller than variable lookup it's quite comparable -- about half.

    If and when (armed with careful profiling and measurement) you decide some nucleus of your computations desperately need optimization, I recommend trying cython -- it's a C / Python merge which tries to be as neat as Python and as fast as C, and while it can't get there 100% it surely makes a good fist of it (in particular, it makes binary code that's quite a bit faster than you can get with its predecessor language, pyrex, as well as being a bit richer than it). For the last few %'s of performance you probably still want to go down to C (or assembly / machine code in some exceptional cases), but that would be really, really rare.

提交回复
热议问题