SQL or Pyspark - Get the last time a column had a different value for each ID

问题

I am using pyspark so I have tried both pyspark code and SQL.

I am trying to get the time that the ADDRESS column was a different value, grouped by USER_ID. The rows are ordered by TIME. Take the below table:

    +---+-------+-------+----+
    | ID|USER_ID|ADDRESS|TIME|
    +---+-------+-------+----+
    |  1|      1|      A|  10|
    |  2|      1|      B|  15|
    |  3|      1|      A|  20|
    |  4|      1|      A|  40|
    |  5|      1|      A|  45|
    +---+-------+-------+----+

The correct new column I would like is as below:

    +---+-------+-------+----+---------+
    | ID|USER_ID|ADDRESS|TIME|LAST_DIFF|
    +---+-------+-------+----+---------+
    |  1|      1|      A|  10|     null|
    |  2|      1|      B|  15|       10|
    |  3|      1|      A|  20|       15|
    |  4|      1|      A|  40|       15|
    |  5|      1|      A|  45|       15|
    +---+-------+-------+----+---------+

I have tried using different windows but none ever seem to get exactly what I want. Any ideas?

回答1:

A simplified version of @jxc's answer.

from pyspark.sql.functions import *
from pyspark.sql import Window
#Window definition
w = Window.partitionBy(col('user_id')).orderBy(col('id'))
#Getting the previous time and classifying rows into groups
grp_df = df.withColumn('grp',sum(when(lag(col('address')).over(w) == col('address'),0).otherwise(1)).over(w)) \
           .withColumn('prev_time',lag(col('time')).over(w))
#Window definition with groups
w_grp = Window.partitionBy(col('user_id'),col('grp')).orderBy(col('id'))
grp_df.withColumn('last_addr_change_time',min(col('prev_time')).over(w_grp)).show()

Use lag with running sum to assign groups when there is a change in the column value (based on the defined window). Get the time from the previous row, which will be used in the next step.
Once you get the groups, use the running minimum to get the last timestamp of the column value change. (Suggest you look at the intermediate results to understand the transformations better)

回答2:

One way using two Window specs:

from pyspark.sql.functions import when, col, lag, sum as fsum
from pyspark.sql import Window

w1 = Window.partitionBy('USER_ID').orderBy('ID')
w2 = Window.partitionBy('USER_ID').orderBy('g')

# create a new sub-group label based on the values of ADDRESS and Previous ADDRESS
df1 = df.withColumn('g', fsum(when(col('ADDRESS') == lag('ADDRESS').over(w1), 0).otherwise(1)).over(w1))

# group by USER_ID and the above sub-group label and calculate the sum of time in the group as diff
# calculate the last_diff and then join the data back to the df1
df2 = df1.groupby('USER_ID', 'g').agg(fsum('Time').alias('diff')).withColumn('last_diff', lag('diff').over(w2))

df1.join(df2, on=['USER_ID', 'g']).show()
+-------+---+---+-------+----+----+---------+                               
|USER_ID|  g| ID|ADDRESS|TIME|diff|last_diff|
+-------+---+---+-------+----+----+---------+
|      1|  1|  1|      A|  10|  10|     null|
|      1|  2|  2|      B|  15|  15|       10|
|      1|  3|  3|      A|  20| 105|       15|
|      1|  3|  4|      A|  40| 105|       15|
|      1|  3|  5|      A|  45| 105|       15|
+-------+---+---+-------+----+----+---------+

df_new = df1.join(df2, on=['USER_ID', 'g']).drop('g', 'diff')

来源：https://stackoverflow.com/questions/59198003/sql-or-pyspark-get-the-last-time-a-column-had-a-different-value-for-each-id

标签

sql

apache-spark

pyspark

apache-spark-sql

pyspark-sql