storing a Dataframe to a hive partition table in spark

问题

I'm trying to store a stream of data comming in from a kafka topic into a hive partition table. I was able to convert the dstream to a dataframe and created a hive context. My code looks like this

val hiveContext = new HiveContext(sc)
hiveContext.setConf("hive.exec.dynamic.partition", "true")
hiveContext.setConf("hive.exec.dynamic.partition.mode", "nonstrict")
newdf.registerTempTable("temp") //newdf is my dataframe
newdf.write.mode(SaveMode.Append).format("osv").partitionBy("date").saveAsTable("mytablename")

But when I deploy the app on cluster, its says

Exception in thread "main" java.lang.IllegalArgumentException: Wrong FS: file:/tmp/spark-3f00838b-c5d9-4a9a-9818-11fbb0007076/scratch_hive_2016-10-18_23-18-33_118_769650074381029645-1, expected: hdfs://

When I try to save it as a normal table and comment out the hiveconfigurations it work. But, with partition table...its giving me this error.

I also tried registering the dataframe as a temp table and then to write that table to the partition table. Doing that also gave me the same error

Can someone please tell how can I solve it. Thanks.

回答1:

You need to use hadoop(hdfs) configured if you are deploying the app on the cluster.

With saveAsTable the default location that Spark saves to is controlled by the HiveMetastore (based on the docs). Another option would be to use saveAsParquetFile and specify the path and then later register that path with your hive metastore OR use the new DataFrameWriter interface and specify the path option write.format(source).mode(mode).options(options).saveAsTable(tableName).

回答2:

I figured it out. In the code for spark app, I declared the scratch dir location as below and it worked.

sqlContext.sql("SET hive.exec.scratchdir=<hdfs location>")

来源：https://stackoverflow.com/questions/40122201/storing-a-dataframe-to-a-hive-partition-table-in-spark

标签

Hadoop

Hive

spark-streaming