Fastest way to grow a numpy numeric array

后端 未结 5 2118
后悔当初
后悔当初 2020-11-27 12:01

Requirements:

  • I need to grow an array arbitrarily large from data.
  • I can guess the size (roughly 100-200) with no guarantees that the array will fit
相关标签:
5条回答
  • 2020-11-27 12:27

    there is a big performance difference in the function that you use for finalization. Consider the following code:

    N=100000
    nruns=5
    
    a=[]
    for i in range(N):
        a.append(np.zeros(1000))
    
    print "start"
    
    b=[]
    for i in range(nruns):
        s=time()
        c=np.vstack(a)
        b.append((time()-s))
    print "Timing version vstack ",np.mean(b)
    
    b=[]
    for i in range(nruns):
        s=time()
        c1=np.reshape(a,(N,1000))
        b.append((time()-s))
    
    print "Timing version reshape ",np.mean(b)
    
    b=[]
    for i in range(nruns):
        s=time()
        c2=np.concatenate(a,axis=0).reshape(-1,1000)
        b.append((time()-s))
    
    print "Timing version concatenate ",np.mean(b)
    
    print c.shape,c2.shape
    assert (c==c2).all()
    assert (c==c1).all()
    

    Using concatenate seems to be twice as fast as the first version and more than 10 times faster than the second version.

    Timing version vstack  1.5774928093
    Timing version reshape  9.67419199944
    Timing version concatenate  0.669512557983
    
    0 讨论(0)
  • 2020-11-27 12:29

    np.append() copy all the data in the array every time, but list grow the capacity by a factor (1.125). list is fast, but memory usage is larger than array. You can use array module of the python standard library if you care about the memory.

    Here is a discussion about this topic:

    How to create a dynamic array

    0 讨论(0)
  • 2020-11-27 12:44

    Using the class declarations in Owen's post, here is a revised timing with some effect of the finalize.

    In short, I find class C to provide an implementation that is over 60x faster than the method in the original post. (apologies for the wall of text)

    The file I used:

    #!/usr/bin/python
    import cProfile
    import numpy as np
    
    # ... class declarations here ...
    
    def test_class(f):
        x = f()
        for i in xrange(100000):
            x.update([i])
        for i in xrange(1000):
            x.finalize()
    
    for x in 'ABC':
        cProfile.run('test_class(%s)' % x)
    

    Now, the resulting timings:

    A:

         903005 function calls in 16.049 seconds
    
    Ordered by: standard name
    
    ncalls  tottime  percall  cumtime  percall filename:lineno(function)
         1    0.000    0.000   16.049   16.049 <string>:1(<module>)
    100000    0.139    0.000    1.888    0.000 fromnumeric.py:1043(ravel)
      1000    0.001    0.000    0.003    0.000 fromnumeric.py:107(reshape)
    100000    0.322    0.000   14.424    0.000 function_base.py:3466(append)
    100000    0.102    0.000    1.623    0.000 numeric.py:216(asarray)
    100000    0.121    0.000    0.298    0.000 numeric.py:286(asanyarray)
      1000    0.002    0.000    0.004    0.000 test.py:12(finalize)
         1    0.146    0.146   16.049   16.049 test.py:50(test_class)
         1    0.000    0.000    0.000    0.000 test.py:6(__init__)
    100000    1.475    0.000   15.899    0.000 test.py:9(update)
         1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
    100000    0.126    0.000    0.126    0.000 {method 'ravel' of 'numpy.ndarray' objects}
      1000    0.002    0.000    0.002    0.000 {method 'reshape' of 'numpy.ndarray' objects}
    200001    1.698    0.000    1.698    0.000 {numpy.core.multiarray.array}
    100000   11.915    0.000   11.915    0.000 {numpy.core.multiarray.concatenate}
    

    B:

         208004 function calls in 16.885 seconds
    
    Ordered by: standard name
    
    ncalls  tottime  percall  cumtime  percall filename:lineno(function)
         1    0.001    0.001   16.885   16.885 <string>:1(<module>)
      1000    0.025    0.000   16.508    0.017 fromnumeric.py:107(reshape)
      1000    0.013    0.000   16.483    0.016 fromnumeric.py:32(_wrapit)
      1000    0.007    0.000   16.445    0.016 numeric.py:216(asarray)
         1    0.000    0.000    0.000    0.000 test.py:16(__init__)
    100000    0.068    0.000    0.080    0.000 test.py:19(update)
      1000    0.012    0.000   16.520    0.017 test.py:23(finalize)
         1    0.284    0.284   16.883   16.883 test.py:50(test_class)
      1000    0.005    0.000    0.005    0.000 {getattr}
      1000    0.001    0.000    0.001    0.000 {len}
    100000    0.012    0.000    0.012    0.000 {method 'append' of 'list' objects}
         1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
      1000    0.020    0.000    0.020    0.000 {method 'reshape' of 'numpy.ndarray' objects}
      1000   16.438    0.016   16.438    0.016 {numpy.core.multiarray.array}
    

    C:

         204010 function calls in 0.244 seconds
    
    Ordered by: standard name
    
    ncalls  tottime  percall  cumtime  percall filename:lineno(function)
         1    0.000    0.000    0.244    0.244 <string>:1(<module>)
      1000    0.001    0.000    0.003    0.000 fromnumeric.py:107(reshape)
         1    0.000    0.000    0.000    0.000 test.py:27(__init__)
    100000    0.082    0.000    0.170    0.000 test.py:32(update)
    100000    0.087    0.000    0.088    0.000 test.py:36(add)
      1000    0.002    0.000    0.005    0.000 test.py:46(finalize)
         1    0.068    0.068    0.243    0.243 test.py:50(test_class)
      1000    0.000    0.000    0.000    0.000 {len}
         1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
      1000    0.002    0.000    0.002    0.000 {method 'reshape' of 'numpy.ndarray' objects}
         6    0.001    0.000    0.001    0.000 {numpy.core.multiarray.zeros}
    

    Class A is destroyed by the updates, class B is destroyed by the finalizes. Class C is robust in the face of both of them.

    0 讨论(0)
  • 2020-11-27 12:48

    I tried a few different things, with timing.

    import numpy as np
    
    1. The method you mention as slow: (32.094 seconds)

      class A:
      
          def __init__(self):
              self.data = np.array([])
      
          def update(self, row):
              self.data = np.append(self.data, row)
      
          def finalize(self):
              return np.reshape(self.data, newshape=(self.data.shape[0]/5, 5))
      
    2. Regular ol Python list: (0.308 seconds)

      class B:
      
          def __init__(self):
              self.data = []
      
          def update(self, row):
              for r in row:
                  self.data.append(r)
      
          def finalize(self):
              return np.reshape(self.data, newshape=(len(self.data)/5, 5))
      
    3. Trying to implement an arraylist in numpy: (0.362 seconds)

      class C:
      
          def __init__(self):
              self.data = np.zeros((100,))
              self.capacity = 100
              self.size = 0
      
          def update(self, row):
              for r in row:
                  self.add(r)
      
          def add(self, x):
              if self.size == self.capacity:
                  self.capacity *= 4
                  newdata = np.zeros((self.capacity,))
                  newdata[:self.size] = self.data
                  self.data = newdata
      
              self.data[self.size] = x
              self.size += 1
      
          def finalize(self):
              data = self.data[:self.size]
              return np.reshape(data, newshape=(len(data)/5, 5))
      

    And this is how I timed it:

    x = C()
    for i in xrange(100000):
        x.update([i])
    

    So it looks like regular old Python lists are pretty good ;)

    0 讨论(0)
  • 2020-11-27 12:51

    If you want improve performance with list operations, have a look to blist library. It is a optimized implementation of python list and other structures.

    I didn't benchmark it yet but the results in their page seem promising.

    0 讨论(0)
提交回复
热议问题