Faster alternatives to Pandas pivot_table

前端未结

关注

 3  1026

I\'m using Pandas pivot_table function on a large dataset (10 million rows, 6 columns). As execution time is paramount, I try to speed up the process. Currently it

相关标签:

3条回答

小鲜肉

2021-02-11 10:02

When you read the csv file into a df, you could pass a convert function (via the read_csv parameter converters), to transform client_name into a hash and downcast orders to an appropriate int type, in particular, an unsigned one.

This function lists the types and their ranges:

import numpy as np

def list_np_types():
    for k, v in np.sctypes.items():
        for i, d in enumerate(v):
            if np.dtype(d).kind in 'iu':
                # only int and uint have a definite range
                fmt = '{:>7}, {:>2}: {:>26}  From: {:>20}\tTo: {}'
                print(fmt.format(k, i, str(d),
                                 str(np.iinfo(d).min),
                                 str(np.iinfo(d).max)))

            else:
                print('{:>7}, {:>2}: {:>26}'.format(k, i, str(d)))


list_np_types()

Output:

    int,  0:       <class 'numpy.int8'>  From:                 -128 To: 127
    int,  1:      <class 'numpy.int16'>  From:               -32768 To: 32767
    int,  2:      <class 'numpy.int32'>  From:          -2147483648 To: 2147483647
    int,  3:      <class 'numpy.int64'>  From: -9223372036854775808 To: 9223372036854775807
   uint,  0:      <class 'numpy.uint8'>  From:                    0 To: 255
   uint,  1:     <class 'numpy.uint16'>  From:                    0 To: 65535
   uint,  2:     <class 'numpy.uint32'>  From:                    0 To: 4294967295
   uint,  3:     <class 'numpy.uint64'>  From:                    0 To: 18446744073709551615
  float,  0:    <class 'numpy.float16'>
  float,  1:    <class 'numpy.float32'>
  float,  2:    <class 'numpy.float64'>
complex,  0:  <class 'numpy.complex64'>
complex,  1: <class 'numpy.complex128'>
 others,  0:             <class 'bool'>
 others,  1:           <class 'object'>
 others,  2:            <class 'bytes'>
 others,  3:              <class 'str'>
 others,  4:       <class 'numpy.void'>

0 讨论(0)

情歌与酒

2021-02-11 10:14

Convert the columns months and industry to categorical columns: https://pandas.pydata.org/pandas-docs/stable/user_guide/categorical.html This way you avoid a lot of string comparisons.

0 讨论(0)
发布评论:

提交评论
- 加载中...
后悔当初

2021-02-11 10:15
You can use Sparse Matrices. They are fast to implement, a little bit restricted though. For example: You can't do indexing on a COO_matrix

I recently needed to train a recommmender system(lightFM) and it accepted sparse matrices as input, which made my job a lot easier. See it in action:
```
row  = np.array([0, 3, 1, 0])
col = np.array([0, 3, 1, 2])
data = np.array([4, 5, 7, 9])
mat = sparse.coo_matrix((data, (row, col)), shape=(4, 4))
```
```
>>> print(mat)
  (0, 0)    4
  (3, 3)    5
  (1, 1)    7
  (0, 2)    9
>>> print(mat.toarray())
[[4 0 9 0]
 [0 7 0 0]
 [0 0 0 0]
 [0 0 0 5]]
```
As you can see, it automatically creates a pivot table for you using the columns and rows of the data you have and fills the rest with zeros. You can convert the sparse matrix into array and dataframe as well (df = pd.DataFrame.sparse.from_spmatrix(mat, index=..., columns=...))
0 讨论(0)
发布评论:

提交评论
- 加载中...