PySpark 1.5 How to Truncate Timestamp to Nearest Minute from seconds

后端 未结 4 1262
予麋鹿
予麋鹿 2021-02-07 21:11

I am using PySpark. I have a column (\'dt\') in a dataframe (\'canon_evt\') that this a timestamp. I am trying to remove seconds from a DateTime value. It is originally read in

4条回答
  •  情书的邮戳
    2021-02-07 21:51

    I think zero323 has the best answer. It's kind of annoying that Spark doesn't support this natively, given how easy it is to implement. For posterity, here is a function that I use:

    def trunc(date, format):
        """Wraps spark's trunc fuction to support day, minute, and hour"""
        import re
        import pyspark.sql.functions as func
    
        # Ghetto hack to get the column name from Column object or string:
        try:
            colname = re.match(r"Column<.?'(.*)'>", str(date)).groups()[0]
        except AttributeError:
            colname = date
    
        alias = "trunc(%s, %s)" % (colname, format)
    
        if format in ('year', 'YYYY', 'yy', 'month', 'mon', 'mm'):
            return func.trunc(date, format).alias(alias)
        elif format in ('day', 'DD'):
            return func.date_sub(date, 0).alias(alias)
        elif format in ('min', ):
            return ((func.round(func.unix_timestamp(date) / 60) * 60).cast("timestamp")).alias(alias)
        elif format in ('hour', ):
            return ((func.round(func.unix_timestamp(date) / 3600) * 3600).cast("timestamp")).alias(alias)
    

提交回复
热议问题