问题
I am implementing a Genetic Algorithm. For this algorithm a number of iterations (between 100 to 500) have to be done where in each iteration all 100 individuals are evaluated for their 'fitness'. To this extent, I have written an evaluate function. However, even for one iteration evaluating the fitness of the 100 individuals already takes 13 seconds. I have to speed this up massively in order to implement an efficient algorithm.
The evaluate function takes two arguments, and then performs some calculations. I will share part of the function since a similar form of calculation is repeated after that. Specifically, I now perform a groupby to a dataframe called df_demand, and then take the sum of a list comprehension that uses the resulting dataframe from the groupby function and another dataframe called df_distance. A snippet of df_demand looks as follows but has larger dimensions in reality (index is just 0,1,2,...):
date customer deliveries warehouse
2020-10-21 A 30 1
2020-10-21 A 47 1
2020-10-21 A 59 2
2020-10-21 B 130 3
2020-10-21 B 102 3
2020-10-21 B 95 2
2020-10-22 A 55 1
2020-10-22 A 46 4
2020-10-22 A 57 4
2020-10-22 B 89 3
2020-10-22 B 104 3
2020-10-22 B 106 4
and a snippet of df_distance is (where the columns are the warehouses):
index 1 2 3 4
A 30.2 54.3 76.3 30.9
B 96.2 34.2 87.7 102.4
C 57.0 99.5 76.4 34.5
Next, I want to groupby df_demand such that each combination of (date, customer, warehouse) appears once and all deliveries for this combination are summed. Finally, I want to calculate total costs. Currently, I have done the following but this is too slow:
def evaluate(df_demand, df_distance):
costs = df_demand.groupby(["date", "customer", "warehouse"]).sum().reset_index()
cost = sum([math.ceil(costs.iat[i, 3] / 20) * df_distance.loc[costs.iat[i, 1], costs.iat[i, 2]] for i in range(len(costs))])
etc...
return cost
Since I have to do many iterations and considering the fact that dimensions of my data are considerably larger, my question is: what is the fastest way to do this operation?
回答1:
let's try:
def get_cost(df, df2):
'''
df: deliveries data
df2: distance data
'''
pivot = np.ceil(df.pivot_table(index=['customer', 'warehouse'], columns=['date'],
values='deliveries', aggfunc='sum', fill_value=0)
.div(20)
)
return pivot.mul(df2.rename_axis(index='customer', columns='warehouse').stack(),
axis='rows').sum().sum()
来源:https://stackoverflow.com/questions/64466451/faster-way-to-group-data-than-pandas-groupby