Faster alternatives to Pandas pivot_table

前端 未结 3 1020
隐瞒了意图╮
隐瞒了意图╮ 2021-02-11 09:33

I\'m using Pandas pivot_table function on a large dataset (10 million rows, 6 columns). As execution time is paramount, I try to speed up the process. Currently it

3条回答
  •  小鲜肉
    小鲜肉 (楼主)
    2021-02-11 10:02

    When you read the csv file into a df, you could pass a convert function (via the read_csv parameter converters), to transform client_name into a hash and downcast orders to an appropriate int type, in particular, an unsigned one.

    This function lists the types and their ranges:

    import numpy as np
    
    def list_np_types():
        for k, v in np.sctypes.items():
            for i, d in enumerate(v):
                if np.dtype(d).kind in 'iu':
                    # only int and uint have a definite range
                    fmt = '{:>7}, {:>2}: {:>26}  From: {:>20}\tTo: {}'
                    print(fmt.format(k, i, str(d),
                                     str(np.iinfo(d).min),
                                     str(np.iinfo(d).max)))
    
                else:
                    print('{:>7}, {:>2}: {:>26}'.format(k, i, str(d)))
    
    
    list_np_types()
    

    Output:

        int,  0:         From:                 -128 To: 127
        int,  1:        From:               -32768 To: 32767
        int,  2:        From:          -2147483648 To: 2147483647
        int,  3:        From: -9223372036854775808 To: 9223372036854775807
       uint,  0:        From:                    0 To: 255
       uint,  1:       From:                    0 To: 65535
       uint,  2:       From:                    0 To: 4294967295
       uint,  3:       From:                    0 To: 18446744073709551615
      float,  0:    
      float,  1:    
      float,  2:    
    complex,  0:  
    complex,  1: 
     others,  0:             
     others,  1:           
     others,  2:            
     others,  3:              
     others,  4:       
    

提交回复
热议问题