Faster alternatives to Pandas pivot_table

前端未结

关注

 3  1020

隐瞒了意图╮ 2021-02-11 09:33

I\'m using Pandas pivot_table function on a large dataset (10 million rows, 6 columns). As execution time is paramount, I try to speed up the process. Currently it

3条回答

小鲜肉 (楼主)

2021-02-11 10:02

When you read the csv file into a df, you could pass a convert function (via the read_csv parameter converters), to transform client_name into a hash and downcast orders to an appropriate int type, in particular, an unsigned one.

This function lists the types and their ranges:

import numpy as np

def list_np_types():
    for k, v in np.sctypes.items():
        for i, d in enumerate(v):
            if np.dtype(d).kind in 'iu':
                # only int and uint have a definite range
                fmt = '{:>7}, {:>2}: {:>26}  From: {:>20}\tTo: {}'
                print(fmt.format(k, i, str(d),
                                 str(np.iinfo(d).min),
                                 str(np.iinfo(d).max)))

            else:
                print('{:>7}, {:>2}: {:>26}'.format(k, i, str(d)))


list_np_types()

Output:

    int,  0:         From:                 -128 To: 127
    int,  1:        From:               -32768 To: 32767
    int,  2:        From:          -2147483648 To: 2147483647
    int,  3:        From: -9223372036854775808 To: 9223372036854775807
   uint,  0:        From:                    0 To: 255
   uint,  1:       From:                    0 To: 65535
   uint,  2:       From:                    0 To: 4294967295
   uint,  3:       From:                    0 To: 18446744073709551615
  float,  0:    
  float,  1:    
  float,  2:    
complex,  0:  
complex,  1: 
 others,  0:             
 others,  1:           
 others,  2:            
 others,  3:              
 others,  4:

0 讨论(0)

查看其它3个回答