I\'m using Pandas pivot_table
function on a large dataset (10 million rows, 6 columns). As execution time is paramount, I try to speed up the process. Currently it
When you read the csv file into a df, you could pass a convert function (via the read_csv
parameter converters), to transform client_name
into a hash and downcast orders
to an appropriate int
type, in particular, an unsigned one.
This function lists the types and their ranges:
import numpy as np
def list_np_types():
for k, v in np.sctypes.items():
for i, d in enumerate(v):
if np.dtype(d).kind in 'iu':
# only int and uint have a definite range
fmt = '{:>7}, {:>2}: {:>26} From: {:>20}\tTo: {}'
print(fmt.format(k, i, str(d),
str(np.iinfo(d).min),
str(np.iinfo(d).max)))
else:
print('{:>7}, {:>2}: {:>26}'.format(k, i, str(d)))
list_np_types()
Output:
int, 0: From: -128 To: 127
int, 1: From: -32768 To: 32767
int, 2: From: -2147483648 To: 2147483647
int, 3: From: -9223372036854775808 To: 9223372036854775807
uint, 0: From: 0 To: 255
uint, 1: From: 0 To: 65535
uint, 2: From: 0 To: 4294967295
uint, 3: From: 0 To: 18446744073709551615
float, 0:
float, 1:
float, 2:
complex, 0:
complex, 1:
others, 0:
others, 1:
others, 2:
others, 3:
others, 4: