Numpy grouping using itertools.groupby performance

前端未结

关注

 10  925

I have many large (>35,000,000) lists of integers that will contain duplicates. I need to get a count for each integer in a list. The following code works, but seems slow. C

相关标签:

10条回答

有刺的猬

2020-12-01 03:24

More than 5 years have passed since Paul's answer was accepted. Interestingly, the sort() is still the bottleneck in the accepted solution.

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
     3                                           @profile
     4                                           def group_paul():
     5         1        99040  99040.0      2.4      import numpy as np
     6         1       305651 305651.0      7.4      values = np.array(np.random.randint(0, 2**32,size=35000000),dtype='u4')
     7         1      2928204 2928204.0    71.3      values.sort()
     8         1        78268  78268.0      1.9      diff = np.concatenate(([1],np.diff(values)))
     9         1       215774 215774.0      5.3      idx = np.concatenate((np.where(diff)[0],[len(values)]))
    10         1           95     95.0      0.0      index = np.empty(len(idx)-1,dtype='u4,u2')
    11         1       386673 386673.0      9.4      index['f0'] = values[idx[:-1]]
    12         1        91492  91492.0      2.2      index['f1'] = np.diff(idx)

The accepted solution runs for 4.0 s on my machine, with radix sort it drops down to 1.7 s.

Just by switching to radix sort, I get an overall 2.35x speedup. The radix sort is more than 4x faster than quicksort in this case.

See How to sort an array of integers faster than quicksort? that was motivated by your question.

For the profiling I used line_profiler and kernprof (the @profile comes from there).

0 讨论(0)

無奈伤痛

2020-12-01 03:24

Replacing len(list(g)) with sum(1 for i in g) gives a 2x speedup

0 讨论(0)
发布评论:

提交评论
- 加载中...

一向

2020-12-01 03:28

This is a numpy solution:

def group():
    import numpy as np
    values = np.array(np.random.randint(0,1<<32,size=35000000),dtype='u4')

    # we sort in place
    values.sort()

    # when sorted the number of occurences for a unique element is the index of 
    # the first occurence when searching from the right - the index of the first
    # occurence when searching from the left.
    #
    # np.dstack() is the numpy equivalent to Python's zip()

    l = np.dstack((values, values.searchsorted(values, side='right') - \
                   values.searchsorted(values, side='left')))

    index = np.fromiter(l, dtype='u4,u2')

if __name__=='__main__':
    from timeit import Timer
    t = Timer("group()","from __main__ import group")
    print t.timeit(number=1)

Runs in about 25 seconds on my machine compared to about 96 for your initial solution (which is a nice improvement).

There might be still room for improvement, I don't use numpy that often.

Edit: added some comments in code.

0 讨论(0)

孤独总比滥情好

2020-12-01 03:30
This is a fairly old thread, but I thought I'd mention that there's a small improvement to be made on the currently-accepted solution:
```
def group_by_edge():
    import numpy as np
    values = np.array(np.random.randint(0,1<<32,size=35000000),dtype='u4')
    values.sort()
    edges = (values[1:] != values[:-1]).nonzero()[0] - 1
    idx = np.concatenate(([0], edges, [len(values)]))
    index = np.empty(len(idx) - 1, dtype= 'u4, u2')
    index['f0'] = values[idx[:-1]]
    index['f1'] = np.diff(idx)
```
This tested as about a half-second faster on my machine; not a huge improvement, but worth something. Additionally, I think it's clearer what's happening here; the two step diff approach is a bit opaque at first glance.
0 讨论(0)
发布评论:

提交评论
- 加载中...

温柔的废话

2020-12-01 03:33

i get a 3x improvement doing something like this:

def group():
    import numpy as np
    values = np.array(np.random.randint(0,3298,size=35000000),dtype='u4')
    values.sort()
    dif = np.ones(values.shape,values.dtype)
    dif[1:] = np.diff(values)
    idx = np.where(dif>0)
    vals = values[idx]
    count = np.diff(idx)

0 讨论(0)

孤独总比滥情好

2020-12-01 03:35
I guess the most obvious and still not mentioned approach is, to simply use collections.Counter. Instead of building a huge amount of temporarily used lists with groupby, it just upcounts integers. It's a oneliner and a 2-fold speedup, but still slower than the pure numpy solutions.
```
def group():
    import sys
    import numpy as np
    from collections import Counter
    values = np.array(np.random.randint(0,sys.maxint,size=35000000),dtype='u4')
    c = Counter(values)

if __name__=='__main__':
    from timeit import Timer
    t = Timer("group()","from __main__ import group")
    print t.timeit(number=1)
```
I get a speedup from 136 s to 62 s for my machine, compared to the initial solution.
0 讨论(0)
发布评论:

提交评论
- 加载中...

1 2 下一页