storing a Dataframe to a hive partition table in spark

拥有回忆 提交于 2019-12-24 01:23:40

问题


I'm trying to store a stream of data comming in from a kafka topic into a hive partition table. I was able to convert the dstream to a dataframe and created a hive context. My code looks like this

val hiveContext = new HiveContext(sc)
hiveContext.setConf("hive.exec.dynamic.partition", "true")
hiveContext.setConf("hive.exec.dynamic.partition.mode", "nonstrict")
newdf.registerTempTable("temp") //newdf is my dataframe
newdf.write.mode(SaveMode.Append).format("osv").partitionBy("date").saveAsTable("mytablename")

But when I deploy the app on cluster, its says

Exception in thread "main" java.lang.IllegalArgumentException: Wrong FS: file:/tmp/spark-3f00838b-c5d9-4a9a-9818-11fbb0007076/scratch_hive_2016-10-18_23-18-33_118_769650074381029645-1, expected: hdfs://

When I try to save it as a normal table and comment out the hiveconfigurations it work. But, with partition table...its giving me this error.

I also tried registering the dataframe as a temp table and then to write that table to the partition table. Doing that also gave me the same error

Can someone please tell how can I solve it. Thanks.


回答1:


You need to use hadoop(hdfs) configured if you are deploying the app on the cluster.

With saveAsTable the default location that Spark saves to is controlled by the HiveMetastore (based on the docs). Another option would be to use saveAsParquetFile and specify the path and then later register that path with your hive metastore OR use the new DataFrameWriter interface and specify the path option write.format(source).mode(mode).options(options).saveAsTable(tableName).




回答2:


I figured it out. In the code for spark app, I declared the scratch dir location as below and it worked.

sqlContext.sql("SET hive.exec.scratchdir=<hdfs location>")


来源:https://stackoverflow.com/questions/40122201/storing-a-dataframe-to-a-hive-partition-table-in-spark

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!