问题
I have a dataframe with multiple columns, two of which are of type pyspark.sql.TimestampType
. I would like to filter this dataframe to rows where the time difference between these two columns is less than one hour.
I'm currently trying to do this like so:
examples = data.filter((data.tstamp - data.date) < datetime.timedelta(hours=1))
But this fails with the following error message:
org.apache.spark.sql.AnalysisException: cannot resolve '(`tstamp` - `date`)' due to data type mismatch: '(`tstamp` - `date`)' requires (numeric or calendarinterval) type, not timestamp
What is the correct method to achieve this filter?
回答1:
Your columns have different types, it's difficult to interpret what the difference means, usually for timestamps it's seconds and for dates it's days. You can transform both columns to unix timestamps beforehand to get a difference in seconds:
import pyspark.sql.functions as psf
data.filter(
psf.abs(psf.unix_timestamp(data.tstamp) - psf.unix_timestamp(data.date)) < 3600
)
EDIT
This function will work on strings (given they have a correct format), on timestamps and on dates:
import datetime
data = hc.createDataFrame(sc.parallelize([[datetime.datetime(2017,1,2,1,1,1), datetime.date(2017,8,7)]]), ['tstamp', 'date'])
data.printSchema()
root
|-- tstamp: timestamp (nullable = true)
|-- date: date (nullable = true)
data.select(
psf.unix_timestamp(data.tstamp).alias('tstamp'), psf.unix_timestamp(data.date).alias("date")
).show()
+----------+----------+
| tstamp| date|
+----------+----------+
|1483315261|1502056800|
+----------+----------+
来源:https://stackoverflow.com/questions/45849311/filter-pyspark-dataframe-based-on-time-difference-between-two-columns