I have a csv containing locations (latitude
,longitude
) for a given user denoted by the id
field, at a given time (timestamp
Seems like you could use the magic of pandas.
It's easy to create a pandas dataframe
from a csv file using the read_csv() function:
import pandas as pd
df = pd.read_csv(filename)
Based on your sample data, this will create the following dataframe
:
ID timestamp latitude longitude
0 3 6/9/2017 22:20 38.795333 77.008883
1 1 5/5/2017 13:10 38.889011 77.050061
2 2 2/10/2017 16:23 40.748249 73.984191
3 1 5/5/2017 12:35 38.920602 77.222329
4 3 6/10/2017 10:00 42.366211 71.020943
5 1 5/5/2017 20:00 38.897416 77.036833
6 2 2/10/2017 7:30 38.851426 77.042298
7 3 6/9/2017 10:20 38.917346 77.222553
8 2 2/10/2017 19:51 40.782869 73.967544
9 3 6/10/2017 6:42 38.954268 77.449695
10 1 5/5/2017 16:35 38.872875 77.007763
11 2 2/10/2017 10:00 40.776931 73.876155
Pandas (and python in general) has extensive libraries for date and time operations. But first, you will need to prepare your data by converting the timestamp column (a string) into a datetime object. I am assuming your data is in the format "MM/DD/YYYY"
(since you didn't specify).
df['timestamp'] = pd.to_datetime(df['timestamp'], format='%m/%d/%Y %H:%M')
You're going to have to define some functions to compute the distance and the velocity. The Haversine distance function is adapted from this answer.
from math import sin, cos, sqrt, atan2, radians
def getDistanceFromLatLonInKm(lat1,lon1,lat2,lon2):
R = 6371 # Radius of the earth in km
dLat = radians(lat2-lat1)
dLon = radians(lon2-lon1)
rLat1 = radians(lat1)
rLat2 = radians(lat2)
a = sin(dLat/2) * sin(dLat/2) + cos(rLat1) * cos(rLat2) * sin(dLon/2) * sin(dLon/2)
c = 2 * atan2(sqrt(a), sqrt(1-a))
d = R * c # Distance in km
return d
def calc_velocity(dist_km, time_start, time_end):
"""Return 0 if time_start == time_end, avoid dividing by 0"""
return dist_km / (time_end - time_start).seconds if time_end > time_start else 0
We want to compute the Haversine function on each row, but we need some information from the first row for each group. Luckily, pandas
makes this easy with sort_values(), groupby() and transform().
The following code makes 3 new columns, one each for the initial latitude, longitude, and time for each ID.
# First sort by ID and timestamp:
df = df.sort_values(by=['ID', 'timestamp'])
# Group the sorted dataframe by ID, and grab the initial value for lat, lon, and time.
df['lat0'] = df.groupby('ID')['latitude'].transform(lambda x: x.iat[0])
df['lon0'] = df.groupby('ID')['longitude'].transform(lambda x: x.iat[0])
df['t0'] = df.groupby('ID')['timestamp'].transform(lambda x: x.iat[0])
# create a new column for distance
df['dist_km'] = df.apply(
lambda row: getDistanceFromLatLonInKm(
lat1=row['latitude'],
lon1=row['longitude'],
lat2=row['lat0'],
lon2=row['lon0']
),
axis=1
)
# create a new column for velocity
df['velocity_kmps'] = df.apply(
lambda row: calc_velocity(
dist_km=row['dist_km'],
time_start=row['t0'],
time_end=row['timestamp']
),
axis=1
)
>>> print(df[['ID', 'timestamp', 'latitude', 'longitude', 'dist_km', 'velocity_kmps']])
ID timestamp latitude longitude dist_km velocity_kmps
3 1 2017-05-05 12:35:00 38.920602 77.222329 0.000000 0.000000
1 1 2017-05-05 13:10:00 38.889011 77.050061 15.314742 0.007293
10 1 2017-05-05 16:35:00 38.872875 77.007763 19.312148 0.001341
5 1 2017-05-05 20:00:00 38.897416 77.036833 16.255868 0.000609
6 2 2017-02-10 07:30:00 38.851426 77.042298 0.000000 0.000000
11 2 2017-02-10 10:00:00 40.776931 73.876155 344.880549 0.038320
2 2 2017-02-10 16:23:00 40.748249 73.984191 335.727502 0.010498
8 2 2017-02-10 19:51:00 40.782869 73.967544 339.206320 0.007629
7 3 2017-06-09 10:20:00 38.917346 77.222553 0.000000 0.000000
0 3 2017-06-09 22:20:00 38.795333 77.008883 22.942974 0.000531
9 3 2017-06-10 06:42:00 38.954268 77.449695 20.070609 0.000274
4 3 2017-06-10 10:00:00 42.366211 71.020943 648.450485 0.007611
From here, I will leave it to you to figure out how to grab the last entry for each ID.