We have a hive warehouse, and wanted to use spark for various tasks (mainly classification). At times write the results back as a hive table. For example, we wrote the following
...perhaps this was not possible when the question was written, but doesn't it make sense now (post 1.3) to use the createDataFrame() call?
After getting your first RDD, it looks like you could make the call, then run a simple SQL statement against the structure to get the whole job done in one pass. (Sum and Grouping) Plus, the DataFrame structure can infer schema directly upon creation if I'm reading the API doc correctly.
(http://spark.apache.org/docs/1.3.1/api/python/pyspark.sql.html#pyspark.sql.HiveContext)
This error can solved by setting hive.exec.scratchdir to the folder where user has access
What version of spark you are using ?
This answer is based on 1.6 & using the data frames.
val sc = new SparkContext(conf)
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.implicits._
val client = Seq((1, "A", 10), (2, "A", 5), (3, "B", 56)).toDF("ID", "Categ", "Amnt")
import org.apache.spark.sql.functions._
client.groupBy("Categ").agg(sum("Amnt").as("Sum"), count("ID").as("count")).show()
+-----+---+-----+
|Categ|Sum|count|
+-----+---+-----+
| A| 15| 2|
| B| 56| 1|
+-----+---+-----+
Hope this helps !!