问题
I have the df below and want to identify any two orders that satisfy all the following condtions:
- Distance between pickups less than X miles
- Distance between dropoffs less Y miles
- Difference between order creation times less Z minutes
Would use haversine import haversine to calculate the difference in pickups for each row and difference in dropoffs for each row or order.
The df I currently have looks like the following:
DAY Order pickup_lat pickup_long dropoff_lat dropoff_long created_time
1/3/19 234e 32.69 -117.1 32.63 -117.08 3/1/19 19:00
1/3/19 235d 40.73 -73.98 40.73 -73.99 3/1/19 23:21
1/3/19 253w 40.76 -73.99 40.76 -73.99 3/1/19 15:26
2/3/19 231y 36.08 -94.2 36.07 -94.21 3/2/19 0:14
3/3/19 305g 36.01 -78.92 36.01 -78.95 3/2/19 0:09
3/3/19 328s 36.76 -119.83 36.74 -119.79 3/2/19 4:33
3/3/19 286n 35.76 -78.78 35.78 -78.74 3/2/19 0:43
I want my output df to be any 2 orders or rows that satisfy the above conditions. What I am not sure of is how to calculate that for each row in the dataframe to return any two rows that satisfy those condtions.
I hope I am explaining my desired output correctly. Thanks for looking!
回答1:
I don't know if it is an optimal solution, but I didn't come up with something different. What I have done:
- created dataframe with all possible orders combination,
- computed all needed measures and for all of the combinations, I added those measures column to the dataframe,
- find the indices of the rows which fulfill the mentioned conditions.
The code:
#create dataframe with all combination
from itertools import combinations
index_comb = list(combinations(trips.index, 2))#trip, your dataframe
col_names = trips.columns
orders1= pd.DataFrame([trips.loc[c[0],:].values for c in index_comb],columns=trips.columns,index = index_comb)
orders2= pd.DataFrame([trips.loc[c[1],:].values for c in index_comb],columns=trips.columns,index = index_comb)
orders2 = orders2.add_suffix('_1')
combined = pd.concat([orders1,orders2],axis=1)
from haversine import haversine
def distance(row):
loc_0 = (row[0],row[1]) # (lat, lon)
loc_1 = (row[2],row[3])
return haversine(loc_0,loc_1,unit='mi')
#pickup diff
pickup_cols = ["pickup_long","pickup_lat","pickup_long_1","pickup_lat_1"]
combined[pickup_cols] = combined[pickup_cols].astype(float)
combined["pickup_dist_mi"] = combined[pickup_cols].apply(distance,axis=1)
#dropoff diff
dropoff_cols = ["dropoff_lat","dropoff_long","dropoff_lat_1","dropoff_long_1"]
combined[dropoff_cols] = combined[dropoff_cols].astype(float)
combined["dropoff_dist_mi"] = combined[dropoff_cols].apply(distance,axis=1)
#creation time diff
combined["time_diff_min"] = abs(pd.to_datetime(combined["created_time"])-pd.to_datetime(combined["created_time_1"])).astype('timedelta64[m]')
#Thresholds
Z = 600
Y = 400
X = 400
#find orders with below conditions
diff_time_Z = combined["time_diff_min"] < Z
pickup_dist_X = combined["pickup_dist_mi"]<X
dropoff_dist_Y = combined["dropoff_dist_mi"]<Y
contitions_idx = diff_time_Z & pickup_dist_X & dropoff_dist_Y
out = combined.loc[contitions_idx,["Order","Order_1","time_diff_min","dropoff_dist_mi","pickup_dist_mi"]]
The output for your data:
Order Order_1 time_diff_min dropoff_dist_mi pickup_dist_mi
(0, 5) 234e 328s 573.0 322.988195 231.300179
(1, 2) 235d 253w 475.0 2.072803 0.896893
(4, 6) 305g 286n 34.0 19.766096 10.233550
Hope I understand you well and that will help.
回答2:
Using your dataframe as above. Drop the index. I'm presuming your created_time column is in datetime format.
import pandas as pd
from geopy.distance import geodesic
Cross merge the dataframe to get all possible combinations of 'Order'.
df_all = pd.merge(df.assign(key=0), df.assign(key=0), on='key').drop('key', axis=1)
Remove all the rows where the orders are equal.
df_all = df_all[-(df_all['Order_x'] == df_all['Order_y'])].copy()
Drop duplicate rows where Order_x, Order_y == [a, b] and [b, a]
# drop duplicate rows
# first combine Order_x and Order_y into a sorted list, and combine into a string
df_all['dup_order'] = df_all[['Order_x', 'Order_y']].values.tolist()
df_all['dup_order'] = df_all['dup_order'].apply(lambda x: "".join(sorted(x)))
# drop the duplicates and reset the index
df_all = df_all.drop_duplicates(subset=['dup_order'], keep='first')
df_all.reset_index(drop=True)
Create a column calculate the time difference in minutes.
df_all['time'] = (df_all['dt_ceated_x'] - df_all['dt_ceated_y']).abs().astype('timedelta64[m]')
Create a column and calculate the distance between drop offs.
df_all['dropoff'] = df_all.apply(
(lambda row: geodesic(
(row['dropoff_lat_x'], row['dropoff_long_x']),
(row['dropoff_lat_x'], row['dropoff_long_y'])
).miles),
axis=1
)
Create a column and calculate the distance between pickups.
df_all['pickup'] = df_all.apply(
(lambda row: geodesic(
(row['pickup_lat_x'], row['pickup_long_x']),
(row['pickup_lat_x'], row['pickup_long_y'])
).miles),
axis=1
)
Filter the results as desired.
X = 1500
Y = 2000
Z = 100
mask_pickups = df_all['pickup'] < X
mask_dropoff = df_all['dropoff'] < Y
mask_time = df_all['time'] < Z
print(df_all[mask_pickups & mask_dropoff & mask_time][['Order_x', 'Order_y', 'time', 'dropoff', 'pickup']])
Order_x Order_y time dropoff pickup
10 235d 231y 53.0 1059.026620 1059.026620
11 235d 305g 48.0 260.325370 259.275948
13 235d 286n 82.0 249.306279 251.929905
25 231y 305g 5.0 853.308110 854.315567
27 231y 286n 29.0 865.026077 862.126593
34 305g 286n 34.0 11.763787 7.842526
来源:https://stackoverflow.com/questions/55039498/identifying-groups-of-two-rows-that-satisfy-three-conditions-in-a-dataframe