Aggregation with Group By date in Spark SQL

前端 未结 3 1385
时光说笑
时光说笑 2021-01-03 04:43

I have an RDD containing a timestamp named time of type long:

root
 |-- id: string (nullable = true)
 |-- value1: string (nullable = true)
          


        
相关标签:
3条回答
  • 2021-01-03 05:34

    I'm using Spark 1.4.0 and since 1.2.0 DATE appears to be present in the Spark SQL API (SPARK-2562). DATE should allow you to group by the time as YYYY-MM-DD.

    I also have a similar data structure, where my created_on is analogous to your time field.

    root
    |-- id: long (nullable = true)
    |-- value1: long (nullable = true)
    |-- created_on: long (nullable = true)
    

    I solved it using FROM_UNIXTIME(created_on,'YYYY-MM-dd') and works well:

    val countQuery = "SELECT FROM_UNIXTIME(created_on,'YYYY-MM-dd') as `date_created`, COUNT(*) AS `count` FROM user GROUP BY FROM_UNIXTIME(created_on,'YYYY-MM-dd')"
    

    From here on you can do the normal operations, execute the query into a dataframe and so on.

    FROM_UNIXTIME worked probably because I have Hive included in my Spark installation and it's a Hive UDF. However it will be included as part of the Spark SQL native syntax in future releases (SPARK-8175).

    0 讨论(0)
  • 2021-01-03 05:36

    Not sure if this is what you meant/needed but I've felt the same struggle-ness dealing with date/timestamp in spark-sql and the only thing I came up with was casting string in timestamp since it seems impossible (to me) having Date type in spark-sql.

    Anyway, this is my code to accomplish something similar (Long in place of String) to your need (maybe):

      val mySQL = sqlContext.sql("select cast(yourLong as timestamp) as time_cast" +
    "                                    ,count(1) total "+
    "                               from logs" +
    "                              group by cast(yourLong as timestamp)" 
    )
    val result= mySQL.map(x=>(x(0).toString,x(1).toString))
    

    and the output is something like this:

    (2009-12-18 10:09:28.0,7)
    (2009-12-18 05:55:14.0,1)
    (2009-12-18 16:02:50.0,2)
    (2009-12-18 09:32:32.0,2)
    

    Could this be useful for you as well even though I'm using timestamp and not Date?

    Hope it could help

    FF

    EDIT: in order to test a "single-cast" from Long to Timestamp I've tried this simple change:

          val mySQL = sqlContext.sql("select cast(1430838439 as timestamp) as time_cast" +
    "                                    ,count(1) total "+
    "                               from logs" +
    "                              group by cast(1430838439 as timestamp)" 
    )
    val result= mySQL.map(x=>(x(0),x(1)))
    

    and all worked fine with the result:

    (1970-01-17 14:27:18.439,4)  // 4 because I have 4 rows in my table
    
    0 讨论(0)
  • 2021-01-03 05:39

    I solved the issue by adding this function:

    def convert( time:Long ) : String = {
      val sdf = new java.text.SimpleDateFormat("yyyy-MM-dd")
      return sdf.format(new java.util.Date(time))
    }
    

    And registering it into the sqlContext like this:

    sqlContext.registerFunction("convert", convert _)
    

    Then I could finally group by date:

    select * from table convert(time)
    
    0 讨论(0)
提交回复
热议问题