Identifying groups of two rows that satisfy three conditions in a dataframe

浪子不回头ぞ 提交于 2019-12-11 04:27:19

问题


I have the df below and want to identify any two orders that satisfy all the following condtions:

  1. Distance between pickups less than X miles
  2. Distance between dropoffs less Y miles
  3. Difference between order creation times less Z minutes

Would use haversine import haversine to calculate the difference in pickups for each row and difference in dropoffs for each row or order.

The df I currently have looks like the following:

  DAY   Order  pickup_lat  pickup_long     dropoff_lat dropoff_long  created_time
 1/3/19  234e    32.69        -117.1          32.63      -117.08   3/1/19 19:00
 1/3/19  235d    40.73        -73.98          40.73       -73.99   3/1/19 23:21
 1/3/19  253w    40.76        -73.99          40.76       -73.99   3/1/19 15:26
 2/3/19  231y    36.08        -94.2           36.07       -94.21   3/2/19 0:14
 3/3/19  305g    36.01        -78.92          36.01       -78.95   3/2/19 0:09
 3/3/19  328s    36.76        -119.83         36.74       -119.79  3/2/19 4:33
 3/3/19  286n    35.76        -78.78          35.78       -78.74   3/2/19 0:43

I want my output df to be any 2 orders or rows that satisfy the above conditions. What I am not sure of is how to calculate that for each row in the dataframe to return any two rows that satisfy those condtions.

I hope I am explaining my desired output correctly. Thanks for looking!


回答1:


I don't know if it is an optimal solution, but I didn't come up with something different. What I have done:

  • created dataframe with all possible orders combination,
  • computed all needed measures and for all of the combinations, I added those measures column to the dataframe,
  • find the indices of the rows which fulfill the mentioned conditions.

The code:

#create dataframe with all combination 
from itertools import combinations

index_comb = list(combinations(trips.index, 2))#trip, your dataframe
col_names = trips.columns
orders1= pd.DataFrame([trips.loc[c[0],:].values for c in index_comb],columns=trips.columns,index = index_comb)
orders2= pd.DataFrame([trips.loc[c[1],:].values for c in index_comb],columns=trips.columns,index = index_comb)
orders2 = orders2.add_suffix('_1')
combined = pd.concat([orders1,orders2],axis=1)

from haversine import haversine

def distance(row):
    loc_0 = (row[0],row[1]) # (lat, lon)
    loc_1 = (row[2],row[3])
    return haversine(loc_0,loc_1,unit='mi')

#pickup diff
pickup_cols = ["pickup_long","pickup_lat","pickup_long_1","pickup_lat_1"]
combined[pickup_cols] = combined[pickup_cols].astype(float)
combined["pickup_dist_mi"] = combined[pickup_cols].apply(distance,axis=1)

#dropoff diff
dropoff_cols = ["dropoff_lat","dropoff_long","dropoff_lat_1","dropoff_long_1"]
combined[dropoff_cols] = combined[dropoff_cols].astype(float)
combined["dropoff_dist_mi"] = combined[dropoff_cols].apply(distance,axis=1)

#creation time diff
combined["time_diff_min"] = abs(pd.to_datetime(combined["created_time"])-pd.to_datetime(combined["created_time_1"])).astype('timedelta64[m]')

#Thresholds
Z = 600
Y = 400
X = 400

#find orders with below conditions
diff_time_Z = combined["time_diff_min"] < Z
pickup_dist_X =  combined["pickup_dist_mi"]<X
dropoff_dist_Y =  combined["dropoff_dist_mi"]<Y
contitions_idx = diff_time_Z & pickup_dist_X & dropoff_dist_Y
out = combined.loc[contitions_idx,["Order","Order_1","time_diff_min","dropoff_dist_mi","pickup_dist_mi"]]

The output for your data:

        Order Order_1  time_diff_min  dropoff_dist_mi  pickup_dist_mi
(0, 5)  234e    328s          573.0       322.988195      231.300179
(1, 2)  235d    253w          475.0         2.072803        0.896893
(4, 6)  305g    286n           34.0        19.766096       10.233550

Hope I understand you well and that will help.




回答2:


Using your dataframe as above. Drop the index. I'm presuming your created_time column is in datetime format.

import pandas as pd
from geopy.distance import geodesic

Cross merge the dataframe to get all possible combinations of 'Order'.

df_all = pd.merge(df.assign(key=0), df.assign(key=0), on='key').drop('key', axis=1)

Remove all the rows where the orders are equal.

df_all = df_all[-(df_all['Order_x'] == df_all['Order_y'])].copy()

Drop duplicate rows where Order_x, Order_y == [a, b] and [b, a]

# drop duplicate rows
# first combine Order_x and Order_y into a sorted list, and combine into a string
df_all['dup_order'] = df_all[['Order_x', 'Order_y']].values.tolist()
df_all['dup_order'] = df_all['dup_order'].apply(lambda x: "".join(sorted(x)))

# drop the duplicates and reset the index
df_all = df_all.drop_duplicates(subset=['dup_order'], keep='first')
df_all.reset_index(drop=True)

Create a column calculate the time difference in minutes.

df_all['time'] = (df_all['dt_ceated_x'] - df_all['dt_ceated_y']).abs().astype('timedelta64[m]')

Create a column and calculate the distance between drop offs.

df_all['dropoff'] = df_all.apply(
    (lambda row: geodesic(
        (row['dropoff_lat_x'], row['dropoff_long_x']),
        (row['dropoff_lat_x'], row['dropoff_long_y'])
    ).miles),
    axis=1
)

Create a column and calculate the distance between pickups.

df_all['pickup'] = df_all.apply(
    (lambda row: geodesic(
        (row['pickup_lat_x'], row['pickup_long_x']),
        (row['pickup_lat_x'], row['pickup_long_y'])
    ).miles),
    axis=1
)

Filter the results as desired.

X = 1500
Y = 2000
Z = 100

mask_pickups = df_all['pickup'] < X
mask_dropoff = df_all['dropoff'] < Y
mask_time = df_all['time'] < Z

print(df_all[mask_pickups & mask_dropoff & mask_time][['Order_x', 'Order_y', 'time', 'dropoff', 'pickup']])

Order_x Order_y  time      dropoff       pickup
10    235d    231y  53.0  1059.026620  1059.026620
11    235d    305g  48.0   260.325370   259.275948
13    235d    286n  82.0   249.306279   251.929905
25    231y    305g   5.0   853.308110   854.315567
27    231y    286n  29.0   865.026077   862.126593
34    305g    286n  34.0    11.763787     7.842526


来源:https://stackoverflow.com/questions/55039498/identifying-groups-of-two-rows-that-satisfy-three-conditions-in-a-dataframe

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!