How to sum distances between data points in a dataset using (Py)Spark?

前端 未结 1 1739
悲&欢浪女
悲&欢浪女 2020-12-22 08:36

I have a dataset of locations in Lat/Lon format of users in a time period. I would like to calculate the distance these users traveled. Sample dataset:

<
相关标签:
1条回答
  • 2020-12-22 09:11

    It looks like a job for window functions. Assuming we define distance as:

    from pyspark.sql.functions import acos, cos, sin, lit, toRadians
    
    def dist(long_x, lat_x, long_y, lat_y):
        return acos(
            sin(toRadians(lat_x)) * sin(toRadians(lat_y)) + 
            cos(toRadians(lat_x)) * cos(toRadians(lat_y)) * 
                cos(toRadians(long_x) - toRadians(long_y))
        ) * lit(6371.0)
    

    you can define window as:

    from pyspark.sql.window import Window
    
    w = Window().partitionBy("User").orderBy("Timestamp")
    

    and compute distances between consecutive observations using lag:

    from pyspark.sql.functions import lag
    
    df.withColumn("dist", dist(
        "longitude", "latitude",
        lag("longitude", 1).over(w), lag("latitude", 1).over(w)
    ).alias("dist"))
    

    After that you can perform standard aggregation.

    0 讨论(0)
提交回复
热议问题