reading and writing from hive tables with spark after aggregation

前端未结

关注

 3  558

We have a hive warehouse, and wanted to use spark for various tasks (mainly classification). At times write the results back as a hive table. For example, we wrote the following

相关标签:

3条回答

后悔当初

2021-02-09 06:40

...perhaps this was not possible when the question was written, but doesn't it make sense now (post 1.3) to use the createDataFrame() call?

After getting your first RDD, it looks like you could make the call, then run a simple SQL statement against the structure to get the whole job done in one pass. (Sum and Grouping) Plus, the DataFrame structure can infer schema directly upon creation if I'm reading the API doc correctly.

(http://spark.apache.org/docs/1.3.1/api/python/pyspark.sql.html#pyspark.sql.HiveContext)

0 讨论(0)
发布评论:

提交评论
- 加载中...
野的像风

2021-02-09 06:50

This error can solved by setting hive.exec.scratchdir to the folder where user has access

0 讨论(0)
发布评论:

提交评论
- 加载中...

孤城傲影

2021-02-09 06:53

What version of spark you are using ?

This answer is based on 1.6 & using the data frames.

val sc = new SparkContext(conf)
val sqlContext = new org.apache.spark.sql.SQLContext(sc)

import sqlContext.implicits._
val client = Seq((1, "A", 10), (2, "A", 5), (3, "B", 56)).toDF("ID", "Categ", "Amnt")

    import org.apache.spark.sql.functions._
    client.groupBy("Categ").agg(sum("Amnt").as("Sum"), count("ID").as("count")).show()


+-----+---+-----+
|Categ|Sum|count|
+-----+---+-----+
|    A| 15|    2|
|    B| 56|    1|
+-----+---+-----+

Hope this helps !!

0 讨论(0)