python jagged array operation efficiency

问题

I am new to Python and I am looking for the most efficient way to do operations with a jagged array.

I have a jagged array like this:

A = array([[array([1, 2, 3]), array([4, 5])],[array([6, 7, 8, 9]), array([10])]], dtype=object)

I want to be able to do things like this:

A=A[A>4]
B=A+A

Apparently python is very efficient for doing operations like this with numpy arrays, but unfortunetely I need to do this for jagged arrays and I havent found such an object in Python. Does it exist in Python, or is there a library that allows to do efficient operations with jagged arrays ?

For the example I gave, here are the outputs I'd like:

A = array([[array([]), array([5])],[array([6, 7, 8, 9]), array([10])]], dtype=object)
B = array([[array([]), array([10])],[array([12, 14, 16, 18]), array([20])]], dtype=object)

But maybe the way Python works it simply cannot do efficient operations with jagged arrays like it does with numpy arrays, I dont know the details.

回答1:

Your array is 2x2:

In [298]: A
Out[298]: 
array([[array([1, 2, 3]), array([4, 5])],
       [array([6, 7, 8, 9]), array([10])]], dtype=object)

While A+A works, boolean tests have not been implemented for this kind of array:

In [299]: A>4
...
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

I'm going to flatten A because it makes it easier to compare with list operations:

In [301]: A1=A.flatten()

In [303]: A1+A1
Out[303]: 
array([array([2, 4, 6]), array([ 8, 10]), array([12, 14, 16, 18]),
       array([20])], dtype=object)

In [304]: [a+a for a in A1]
Out[304]: [array([2, 4, 6]), array([ 8, 10]), array([12, 14, 16, 18]), array([20])]

In [305]: timeit A1+A1
100000 loops, best of 3: 6.85 µs per loop

In [306]: timeit [a+a for a in A1]
100000 loops, best of 3: 9.09 µs per loop

The array operation is a bit faster than a list comprehension. But if I first turn the array into a list:

In [307]: A1l=A1.tolist()

In [308]: A1l
Out[308]: [array([1, 2, 3]), array([4, 5]), array([6, 7, 8, 9]), array([10])]

In [309]: timeit [a+a for a in A1l]
100000 loops, best of 3: 5.2 µs per loop

times improve. This is a good indication that the A1+A1 (or even A+A) is using a similar sort of iteration.

So the straight forward way of performing your A,B calculation is

In [310]: A2=[a[a>4] for a in A1]
In [311]: B=[a+a for a in A2]
In [312]: B
Out[312]: [array([], dtype=int32), array([10]), array([12, 14, 16, 18]), array([20])]

(we can convert to/from arrays and lists as needed).

A numpy array stores its data a flat databuffer, and uses the shape and strides attributes to quickly calculate the location of any element, regardless of the dimensions. The fast array operations use compiled code that rapidly steps though the databuffers of arguments, performing the operations element by element (or some other combination).

A dtype object array also has the flat databuffer, but the elements are pointers to lists or arrays elsewhere. So while it can index individual elements quickly, it still has to perform a Python call(s) to access the arrays. So especially when the array is 1d, it is virtually the same as a flat list with the same pointers.

Multidimensional object arrays are nicer than nested lists. You can reshape them, access elements (A[1,3] v Al[1][3]), transpose them, etc. But when it comes to iterating through all the subarrays they don't offer much of a benefit.

Looking again at your 2d array:

In [315]: timeit A+A
100000 loops, best of 3: 6.93 µs per loop  # 6.85 for A1+A1 (above)

In [316]: timeit [[j+j for j in i] for i in A]
100000 loops, best of 3: 17.1 µs per loop

In [317]: Al = A.tolist()

In [318]: timeit [[j+j for j in i] for i in Al]
100000 loops, best of 3: 7.01 µs per loop    # 5.2 for A1l flat list

Basically the same time for summing the array and iterating through the equivalent nested list.

回答2:

The performance of numpy jagged array may not be optimal, but there are enough reasons to believe that it should be much better than using python nested list. As explained in your earlier post:

On principle you should have some performance bonus because every element is a numpy array. So you just need a 2 dimensional loop rather than a 3D loop (if you store every number in nested lists). Also it always saves you lots of memory allocation time to avoid using python list.

Here is a simple test:

import time,sys,random
import numpy as np
rand = np.random.rand
L = np.array([[rand(100), rand(200)],[rand(400), rand(300)]], dtype=object)
L1 = [random.random() for i in range(1000)]
arrFunc = np.vectorize(lambda x:x[x>0.3],otypes=[np.ndarray])

start = time.time()
if sys.argv[1]=='np':
  for i in range(100000):
    B=i*L
else:
  for i in range(100000):
    B=[i*x for x in L1]

end = time.time()
print ('Arithmetic Op: ', end-start)


start = time.time()
if sys.argv[1]=='np':
  for i in range(100000):
    B=arrFunc(L)
else:
  for i in range(100000):
    B=[x for x in L1 if x<0.3]
end = time.time()
print ('Indexing       ', end-start)

Result:

> python testNpJarray.py np
Arithmetic Op:  3.9719998836517334
Indexing        8.079999923706055

> python testNpJarray.py list
Arithmetic Op:  53.289000034332275
Indexing        52.10899996757507

This test may not be quite fare because the outter numpy array is quite small, you are welcome to change the size to fit into your application and tell us the results.

来源：https://stackoverflow.com/questions/37212981/python-jagged-array-operation-efficiency

标签

python

arrays

numpy

jagged-arrays