reading and writing from hive tables with spark after aggregation

前端 未结 3 551
故里飘歌
故里飘歌 2021-02-09 06:29

We have a hive warehouse, and wanted to use spark for various tasks (mainly classification). At times write the results back as a hive table. For example, we wrote the following

相关标签:
3条回答
  • 2021-02-09 06:40

    ...perhaps this was not possible when the question was written, but doesn't it make sense now (post 1.3) to use the createDataFrame() call?

    After getting your first RDD, it looks like you could make the call, then run a simple SQL statement against the structure to get the whole job done in one pass. (Sum and Grouping) Plus, the DataFrame structure can infer schema directly upon creation if I'm reading the API doc correctly.

    (http://spark.apache.org/docs/1.3.1/api/python/pyspark.sql.html#pyspark.sql.HiveContext)

    0 讨论(0)
  • 2021-02-09 06:50

    This error can solved by setting hive.exec.scratchdir to the folder where user has access

    0 讨论(0)
  • 2021-02-09 06:53

    What version of spark you are using ?

    This answer is based on 1.6 & using the data frames.

    val sc = new SparkContext(conf)
    val sqlContext = new org.apache.spark.sql.SQLContext(sc)
    
    import sqlContext.implicits._
    val client = Seq((1, "A", 10), (2, "A", 5), (3, "B", 56)).toDF("ID", "Categ", "Amnt")
    
        import org.apache.spark.sql.functions._
        client.groupBy("Categ").agg(sum("Amnt").as("Sum"), count("ID").as("count")).show()
    
    
    +-----+---+-----+
    |Categ|Sum|count|
    +-----+---+-----+
    |    A| 15|    2|
    |    B| 56|    1|
    +-----+---+-----+
    

    Hope this helps !!

    0 讨论(0)
提交回复
热议问题