Python: Counting cumulative occurrences of values in a pandas series

后端 未结 2 545
心在旅途
心在旅途 2021-01-13 04:16

I have a DataFrame that looks like this:

    fruit
0  orange
1  orange
2  orange
3    pear
4  orange
5   apple
6   apple
7    pear
8    pear
9  orange


        
相关标签:
2条回答
  • 2021-01-13 04:41

    You could use groupby and cumcount:

    df['cum_count'] = df.groupby('fruit').cumcount() + 1
    
    In [16]: df
    Out[16]:
        fruit  cum_count
    0  orange          1
    1  orange          2
    2  orange          3
    3    pear          1
    4  orange          4
    5   apple          1
    6   apple          2
    7    pear          2
    8    pear          3
    9  orange          5
    

    Timing

    In [8]: %timeit [(df.fruit[0:i+1] == x).sum() for i, x in df.fruit.iteritems()]
    100 loops, best of 3: 3.76 ms per loop
    
    In [9]: %timeit df.groupby('fruit').cumcount() + 1
    1000 loops, best of 3: 926 µs per loop
    

    So it's faster in 4 times.

    0 讨论(0)
  • 2021-01-13 04:46

    Maybe better is use groupby with cumcount with specify column, because it is more efficient way:

    df['cum_count'] = df.groupby('fruit' )['fruit'].cumcount() + 1
    print df
    
        fruit  cum_count
    0  orange          1
    1  orange          2
    2  orange          3
    3    pear          1
    4  orange          4
    5   apple          1
    6   apple          2
    7    pear          2
    8    pear          3
    9  orange          5
    

    Comparing len(df) = 10, my solution is the fastest:

    In [3]: %timeit df.groupby('fruit')['fruit'].cumcount() + 1
    The slowest run took 11.67 times longer than the fastest. This could mean that an intermediate result is being cached 
    1000 loops, best of 3: 299 µs per loop
    
    In [4]: %timeit df.groupby('fruit').cumcount() + 1
    The slowest run took 12.78 times longer than the fastest. This could mean that an intermediate result is being cached 
    1000 loops, best of 3: 921 µs per loop
    
    In [5]: %timeit [(df.fruit[0:i+1] == x).sum() for i, x in df.fruit.iteritems()]
    The slowest run took 4.47 times longer than the fastest. This could mean that an intermediate result is being cached 
    100 loops, best of 3: 2.72 ms per loop
    

    Comparing len(df) = 10k:

    In [7]: %timeit df.groupby('fruit')['fruit'].cumcount() + 1
    The slowest run took 4.65 times longer than the fastest. This could mean that an intermediate result is being cached 
    1000 loops, best of 3: 845 µs per loop
    
    In [8]: %timeit df.groupby('fruit').cumcount() + 1
    The slowest run took 5.59 times longer than the fastest. This could mean that an intermediate result is being cached 
    100 loops, best of 3: 1.59 ms per loop
    
    In [9]: %timeit [(df.fruit[0:i+1] == x).sum() for i, x in df.fruit.iteritems()]
    1 loops, best of 3: 5.12 s per loop
    
    0 讨论(0)
提交回复
热议问题