问题
I was trying to figure out if there was a way in which where I had a dataframe with multiple fields and I wanted to segment or group the dataframe into a new dataframe based on if the values of specific columns were within x amount of each other?
I.D | Created_Time | Home_Longitude | Home_Latitude | Work_Longitude | Home_Latitude
Faa1 2019-02-23 20:01:13.362 -77.0364 38.8951 -72.0364 38.8951
Above is how the original df looks with multiple rows. I want to create a new dataframe where all rows or I.Ds contain created times that are within x amount of minutes of each other, and using haversine within x miles of one another homes, and x miles within one another work.
So Basically trying to filter this dataframe into a df that only contains rows that are within x minutes of order created time, x miles within one another homes and , x miles within each work column value.
回答1:
I did this by
- calculating the distances (in miles) and time relative to the first row
- My logic
- if n rows are within x minutes/miles of the first row, then those n rows are within x minutes/miles of each other
- My logic
- filter the data using the required distance and time filter conditions
Generate some dummy data
- random co-ordinates
# Generate random Lat-Long points
def newpoint():
return uniform(-180,180), uniform(-90, 90)
home_points = (newpoint() for x in range(289))
work_points = (newpoint() for x in range(289))
df = pd.DataFrame(home_points, columns=['Home_Longitude', 'Home_Latitude'])
df[['Work_Longitude', 'Work_Latitude']] = pd.DataFrame(work_points)
# Insert `ID` column as sequence of integers
df.insert(0, 'ID', range(289))
# Generate random datetimes, separated by 5 minute intervals
# (you can choose your own interval)
times = pd.date_range('2012-10-01', periods=289, freq='5min')
df.insert(1, 'Created_Time', times)
print(df.head())
ID Created_Time Home_Longitude Home_Latitude Work_Longitude Work_Latitude
0 0 2012-10-01 00:00:00 -48.885981 -39.412351 -68.756244 24.739860
1 1 2012-10-01 00:05:00 58.584893 59.851739 -119.978429 -87.687858
2 2 2012-10-01 00:10:00 -18.623484 85.435248 -14.204142 -3.693993
3 3 2012-10-01 00:15:00 -29.721788 71.671103 -69.833253 -12.446204
4 4 2012-10-01 00:20:00 168.257968 -13.247833 60.979050 -18.393925
Create Python helper function with haversine distance formula (vectorized haversine distance formula, in km)
def haversine(lat1, lon1, lat2, lon2, to_radians=False, earth_radius=6371):
"""
slightly modified version: of http://stackoverflow.com/a/29546836/2901002
Calculate the great circle distance between two points
on the earth (specified in decimal degrees or in radians)
All (lat, lon) coordinates must have numeric dtypes and be of equal length.
"""
if to_radians:
lat1, lon1, lat2, lon2 = np.radians([lat1, lon1, lat2, lon2])
a = np.sin((lat2-lat1)/2.0)**2 + \
np.cos(lat1) * np.cos(lat2) * np.sin((lon2-lon1)/2.0)**2
return earth_radius * 2 * np.arcsin(np.sqrt(a))
Calculate distances (relative to first row) in km, using haversine formula. Then, convert km to miles
df['Home_dist_miles'] = \
haversine(df.Home_Longitude, df.Home_Latitude,
df.loc[0, 'Home_Longitude'], df.loc[0, 'Home_Latitude'])*0.621371
df['Work_dist_miles'] = \
haversine(df.Work_Longitude, df.Work_Latitude,
df.loc[0, 'Work_Longitude'], df.loc[0, 'Work_Latitude'])*0.621371
Calculate time differences, in minutes (relative to first row)
- for the dummy data here, the time differences will be in multiples of 5 minutes (but in real data, they could be anything)
df['time'] = df['Created_Time'] - df.loc[0, 'Created_Time']
df['time_min'] = (df['time'].dt.days * 24 * 60 * 60 + df['time'].dt.seconds)/60
Apply filters (method 1) and then select any 2 rows that satisfy the conditions stated in the OP
home_filter = df['Home_dist_miles']<=12000 # within 12,000 miles
work_filter = df['Work_dist_miles']<=8000 # within 8,000 miles
time_filter = df['time_min']<=25 # within 25 minutes
df_filtered = df.loc[(home_filter) & (work_filter) & (time_filter)]
# Select any 2 rows that satisfy required conditions
df_any2rows = df_filtered.sample(n=2)
print(df_any2rows)
ID Created_Time Home_Longitude Home_Latitude Work_Longitude Work_Latitude Home_dist_miles Work_dist_miles time time_min
0 0 2012-10-01 00:00:00 -168.956448 -42.970705 -6.340945 -12.749469 0.000000 0.000000 00:00:00 0.0
4 4 2012-10-01 00:20:00 -73.120352 13.748187 -36.953587 23.528789 6259.078588 5939.425019 00:20:00 20.0
Apply filters (method 2) and then select any 2 rows that satisfy the conditions stated in the OP
multi_query = """Home_dist_miles<=12000 & \
Work_dist_miles<=8000 & \
time_min<=25"""
df_filtered = df.query(multi_query)
# Select any 2 rows that satisfy required conditions
df_any2rows = df_filtered.sample(n=2)
print(df_any2rows)
ID Created_Time Home_Longitude Home_Latitude Work_Longitude Work_Latitude Home_dist_miles Work_dist_miles time time_min
0 0 2012-10-01 00:00:00 -168.956448 -42.970705 -6.340945 -12.749469 0.000000 0.000000 00:00:00 0.0
4 4 2012-10-01 00:20:00 -73.120352 13.748187 -36.953587 23.528789 6259.078588 5939.425019 00:20:00 20.0
来源:https://stackoverflow.com/questions/55034820/segmenting-or-grouping-a-df-based-on-parameters-or-differences-within-columns-go