Column filtering in PySpark

前端 未结 2 1851
时光取名叫无心
时光取名叫无心 2021-01-31 06:39

I have a dataframe df loaded from Hive table and it has a timestamp column, say ts, with string type of format dd-MMM-yy hh.mm.ss.MS a (co

2条回答
  •  悲&欢浪女
    2021-01-31 07:27

    It is possible to use user defined function.

    from datetime import datetime, timedelta
    from pyspark.sql.types import BooleanType, TimestampType
    from pyspark.sql.functions import udf, col
    
    def in_last_5_minutes(now):
        def _in_last_5_minutes(then):
            then_parsed = datetime.strptime(then, '%d-%b-%y %I.%M.%S.%f %p')
            return then_parsed > now - timedelta(minutes=5)
        return udf(_in_last_5_minutes, BooleanType())
    

    Using some dummy data:

    df = sqlContext.createDataFrame([
        (1, '14-Jul-15 11.34.29.000000 AM'),
        (2, '14-Jul-15 11.34.27.000000 AM'),
        (3, '14-Jul-15 11.32.11.000000 AM'),
        (4, '14-Jul-15 11.29.00.000000 AM'),
        (5, '14-Jul-15 11.28.29.000000 AM')
    ], ('id', 'datetime'))
    
    now = datetime(2015, 7, 14, 11, 35)
    df.where(in_last_5_minutes(now)(col("datetime"))).show()
    

    And as expected we get only 3 entries:

    +--+--------------------+
    |id|            datetime|
    +--+--------------------+
    | 1|14-Jul-15 11.34.2...|
    | 2|14-Jul-15 11.34.2...|
    | 3|14-Jul-15 11.32.1...|
    +--+--------------------+
    

    Parsing datetime string all over again is rather inefficient so you may consider storing TimestampType instead.

    def parse_dt():
        def _parse(dt):
            return datetime.strptime(dt, '%d-%b-%y %I.%M.%S.%f %p')
        return udf(_parse, TimestampType())
    
    df_with_timestamp = df.withColumn("timestamp", parse_dt()(df.datetime))
    
    def in_last_5_minutes(now):
        def _in_last_5_minutes(then):
            return then > now - timedelta(minutes=5)
        return udf(_in_last_5_minutes, BooleanType())
    
    df_with_timestamp.where(in_last_5_minutes(now)(col("timestamp")))
    

    and result:

    +--+--------------------+--------------------+
    |id|            datetime|           timestamp|
    +--+--------------------+--------------------+
    | 1|14-Jul-15 11.34.2...|2015-07-14 11:34:...|
    | 2|14-Jul-15 11.34.2...|2015-07-14 11:34:...|
    | 3|14-Jul-15 11.32.1...|2015-07-14 11:32:...|
    +--+--------------------+--------------------+
    

    Finally it is possible to use raw SQL query with timestamps:

    query = """SELECT * FROM df
         WHERE unix_timestamp(datetime, 'dd-MMM-yy HH.mm.ss.SSSSSS a') > {0}
         """.format(time.mktime((now - timedelta(minutes=5)).timetuple()))
    
    sqlContext.sql(query)
    

    Same as above it would be more efficient to parse date strings once.

    If column is already a timestamp it possible to use datetime literals:

    from pyspark.sql.functions import lit
    
    df_with_timestamp.where(
        df_with_timestamp.timestamp > lit(now - timedelta(minutes=5)))
    

    EDIT

    Since Spark 1.5 you can parse date string as follows:

    from pyspark.sql.functions import from_unixtime, unix_timestamp
    from pyspark.sql.types import TimestampType
    
    df.select((from_unixtime(unix_timestamp(
        df.datetime, "yy-MMM-dd h.mm.ss.SSSSSS aa"
    ))).cast(TimestampType()).alias("datetime"))
    

提交回复
热议问题