Python pypy: Efficient sum of absolute array/vector difference

问题

I am trying to reduce the computation time of my script,which is run with pypy. It has to calculate for a large number of lists/vectors/arrays the pairwise sums of absolute differences. The length of the input vectors is quite small, between 10 and 500. I tested three different approaches so far:

1) Naive approach, input as lists:

def std_sum(v1, v2):
distance = 0.0
for (a,b) in izip(v1, v2):
     distance += math.fabs(a-b)
 return distance

2) With lambdas and reduce, input as lists:

lzi = lambda v1, v2: reduce(lambda s, (a,b):s + math.fabs(a-b), izip(v1, v2), 0)
def lmd_sum(v1, v2):
    return lzi(v1, v2)

3) Using numpy, input as numpy.arrays:

def np_sum(v1, v2):
    return np.sum(np.abs(v1-v2))

On my machine, using pypy and pairs from itertools.combinations_with_replacement of 500 such lists, the first two approaches are very similar (roughly 5 seconds), while the numpy approach is significantly slower, taking around 12 seconds.

Is there a faster way to do the calculations? The lists are read and parsed from text files and an increased preprocessing time would be no problem (such as creating numpy arrays). The lists contain floating point numbers and are of equal size which is known beforehand.

The script I use for ''benchmarking'' can be found here and some example data here.

回答1:

Is there a faster way to do the calculations? The lists are read and parsed from text files and an increased preprocessing time would be no problem (such as creating numpy arrays). The lists contain floating point numbers and are of equal size which is known beforehand.

PyPy is very good at optimizing list accesses, so you should probably stick to using lists.

One thing that will help PyPy optimize things is to make sure your lists always have only one type of objects. I.e. if you read strings from a file, don't put them in a list, then parse them into floats in-place. Rather, create the list with floats, for example by parsing each string as soon as it is read. Likewise, never try to preallocate a list, especially with [None,]*N, or PyPy will not be able to guess that all the elements have the same type.

Second, iterate the list as few times as possible. Your np_sum function walks both arrays three times (subtract, abs, sum) unless PyPy notices and can optimize it. Both 1. and 2. walk the list once, so they are faster.

来源：https://stackoverflow.com/questions/23983148/python-pypy-efficient-sum-of-absolute-array-vector-difference

标签

python

performance

numpy

vector

pypy