I have an RDD containing a timestamp named time of type long:
root
|-- id: string (nullable = true)
|-- value1: string (nullable = true)
I'm using Spark 1.4.0 and since 1.2.0 DATE
appears to be present in the Spark SQL API (SPARK-2562). DATE
should allow you to group by the time as YYYY-MM-DD
.
I also have a similar data structure, where my created_on
is analogous to your time
field.
root
|-- id: long (nullable = true)
|-- value1: long (nullable = true)
|-- created_on: long (nullable = true)
I solved it using FROM_UNIXTIME(created_on,'YYYY-MM-dd')
and works well:
val countQuery = "SELECT FROM_UNIXTIME(created_on,'YYYY-MM-dd') as `date_created`, COUNT(*) AS `count` FROM user GROUP BY FROM_UNIXTIME(created_on,'YYYY-MM-dd')"
From here on you can do the normal operations, execute the query into a dataframe and so on.
FROM_UNIXTIME
worked probably because I have Hive included in my Spark installation and it's a Hive UDF. However it will be included as part of the Spark SQL native syntax in future releases (SPARK-8175).