Filter pyspark dataframe based on time difference between two columns

ぃ、小莉子 提交于 2021-01-28 06:42:41

问题


I have a dataframe with multiple columns, two of which are of type pyspark.sql.TimestampType. I would like to filter this dataframe to rows where the time difference between these two columns is less than one hour.

I'm currently trying to do this like so: examples = data.filter((data.tstamp - data.date) < datetime.timedelta(hours=1))

But this fails with the following error message:

org.apache.spark.sql.AnalysisException: cannot resolve '(`tstamp` - `date`)' due to data type mismatch: '(`tstamp` - `date`)' requires (numeric or calendarinterval) type, not timestamp

What is the correct method to achieve this filter?


回答1:


Your columns have different types, it's difficult to interpret what the difference means, usually for timestamps it's seconds and for dates it's days. You can transform both columns to unix timestamps beforehand to get a difference in seconds:

import pyspark.sql.functions as psf
data.filter(
    psf.abs(psf.unix_timestamp(data.tstamp) - psf.unix_timestamp(data.date)) < 3600
)

EDIT

This function will work on strings (given they have a correct format), on timestamps and on dates:

import datetime
data = hc.createDataFrame(sc.parallelize([[datetime.datetime(2017,1,2,1,1,1), datetime.date(2017,8,7)]]), ['tstamp', 'date'])
data.printSchema()
    root
     |-- tstamp: timestamp (nullable = true)
     |-- date: date (nullable = true)

data.select(
    psf.unix_timestamp(data.tstamp).alias('tstamp'), psf.unix_timestamp(data.date).alias("date")
).show()
    +----------+----------+
    |    tstamp|      date|
    +----------+----------+
    |1483315261|1502056800|
    +----------+----------+


来源:https://stackoverflow.com/questions/45849311/filter-pyspark-dataframe-based-on-time-difference-between-two-columns

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!