问题:

I have a dataframe df loaded from Hive table and it has a timestamp column, say ts, with string type of format dd-MMM-yy hh.mm.ss.MS a (converted to python datetime library, this is %d-%b-%y %I.%M.%S.%f %p).

Now I want to filter rows from the dataframe that are from the last five minutes:

only_last_5_minutes = df.filter(     datetime.strptime(df.ts, '%d-%b-%y %I.%M.%S.%f %p') > datetime.now() - timedelta(minutes=5) )

However, this does not work and I get this message

TypeError: strptime() argument 1 must be string, not Column

It looks like I have wrong application of column operation and it seems to me I have to create a lambda function to filter each column that satisfies the desired condition, but being a newbie to Python and lambda expression in particular, I don't know how to create my filter correct. Please advise.

P.S. I prefer to express my filters as Python native (or SparkSQL) rather than a filter inside Hive sql query expression 'WHERE'.

preferred:

df = sqlContext.sql("SELECT * FROM my_table") df.filter( // filter here)

not preferred:

df = sqlContext.sql("SELECT * FROM my_table WHERE...")

回答1:

It is possible to use user defined function.

from datetime import datetime, timedelta from pyspark.sql.types import BooleanType, TimestampType from pyspark.sql.functions import udf, col  def in_last_5_minutes(now):     def _in_last_5_minutes(then):         then_parsed = datetime.strptime(then, '%d-%b-%y %I.%M.%S.%f %p')         return then_parsed > now - timedelta(minutes=5)     return udf(_in_last_5_minutes, BooleanType())

Using some dummy data:

df = sqlContext.createDataFrame([     (1, '14-Jul-15 11.34.29.000000 AM'),     (2, '14-Jul-15 11.34.27.000000 AM'),     (3, '14-Jul-15 11.32.11.000000 AM'),     (4, '14-Jul-15 11.29.00.000000 AM'),     (5, '14-Jul-15 11.28.29.000000 AM') ], ('id', 'datetime'))  now = datetime(2015, 7, 14, 11, 35) df.where(in_last_5_minutes(now)(col("datetime"))).show()

And as expected we get only 3 entries:

+--+--------------------+ |id|            datetime| +--+--------------------+ | 1|14-Jul-15 11.34.2...| | 2|14-Jul-15 11.34.2...| | 3|14-Jul-15 11.32.1...| +--+--------------------+

Parsing datetime string all over again is rather inefficient so you may consider storing TimestampType instead.

def parse_dt():     def _parse(dt):         return datetime.strptime(dt, '%d-%b-%y %I.%M.%S.%f %p')     return udf(_parse, TimestampType())  df_with_timestamp = df.withColumn("timestamp", parse_dt()(df.datetime))  def in_last_5_minutes(now):     def _in_last_5_minutes(then):         return then > now - timedelta(minutes=5)     return udf(_in_last_5_minutes, BooleanType())  df_with_timestamp.where(in_last_5_minutes(now)(col("timestamp")))

and result:

+--+--------------------+--------------------+ |id|            datetime|           timestamp| +--+--------------------+--------------------+ | 1|14-Jul-15 11.34.2...|2015-07-14 11:34:...| | 2|14-Jul-15 11.34.2...|2015-07-14 11:34:...| | 3|14-Jul-15 11.32.1...|2015-07-14 11:32:...| +--+--------------------+--------------------+

Finally it is possible to use raw SQL query with timestamps:

query = """SELECT * FROM df      WHERE unix_timestamp(datetime, 'dd-MMM-yy HH.mm.ss.SSSSSS a') > {0}      """.format(time.mktime((now - timedelta(minutes=5)).timetuple()))  sqlContext.sql(query)

Same as above it would be more efficient to parse date strings once.

If column is already a timestamp it possible to use datetime literals:

from pyspark.sql.functions import lit  df_with_timestamp.where(     df_with_timestamp.timestamp > lit(now - timedelta(minutes=5)))

EDIT

Since Spark 1.5 you can parse date string as follows:

from pyspark.sql.functions import from_unixtime, unix_timestamp from pyspark.sql.types import TimestampType  df.select((from_unixtime(unix_timestamp(     df.datetime, "yy-MMM-dd h.mm.ss.SSSSSS aa" ))).cast(TimestampType()).alias("datetime"))

转载请标明出处:Column filtering in PySpark

文章来源: Column filtering in PySpark

标签

jul