问题
I cant save a table to hive database anymore using metastore. I see the tables in spark using spark.sql
but I cant see the same tables in hive database. I tried this but it doesnt store the table to hive. How can I configure the hive metastore? The spark version is 2.3.1.
If you want more details please comment.
%spark
import org.apache.spark.sql.SparkSession
val spark = (SparkSession
.builder
.appName("interfacing spark sql to hive metastore without configuration file")
.config("hive.metastore.uris", "thrift://xxxxxx.xxx:9083") // replace with your hivemetastore service's thrift url
.enableHiveSupport() // don't forget to enable hive support
.getOrCreate())
spark.conf.get("spark.sql.warehouse.dir")// Output: res2: String = /apps/spark/warehouse
spark.conf.get("hive.metastore.warehouse.dir")// NotSuchElement Exception
spark.conf.get("spark.hadoop.hive.metastore.uris")// NotSuchElement Exception
var df = (spark
.read
.format("parquet")
.load(dataPath)
df.createOrReplaceTempView("my_temp_table");
spark.sql("drop table if exists my_table");
spark.sql("create table my_table using hive as select * from my_temp_table");
spark.sql("show tables").show(false)// I see my_table in default database
Update after @catpaws answer: HDP 3.0 and later, Hive and Spark use independent catalogues
Save table to spark catalogue:
df.createOrReplaceTempView("my_temp_table");
spark.sql("create table my_table as select * from my_temp_table");
VS
Save table to hive catalogue:
val hive = com.hortonworks.spark.sql.hive.llap.HiveWarehouseBuilder.session(spark).build()
hive.createTable("newTable")
.ifNotExists()
.column("ws_sold_time_sk", "bigint")
...// x 200 columns
.column("ws_ship_date_sk", "bigint")
.create()
df.write.format(HIVE_WAREHOUSE_CONNECTOR)
.option("table", "newTable")
.save()
As you see in this way Hive Warehouse Connector is very impractical for dataframes with hundred columns. Is there any way to save large dataframes to Hive?
回答1:
From Hortonworks docs: In HDP 3.0 and later, Spark and Hive use independent catalogs for accessing SparkSQL or Hive tables on the same or different platforms. A table created by Spark resides in the Spark catalog. A table created by Hive resides in the Hive catalog. Databases fall under the catalog namespace, similar to how tables belong to a database namespace. Although independent, these tables interoperate and you can see Spark tables in the Hive catalog, but only when using the Hive Warehouse Connector.
Use Write operations of the HWC API to write a DataFrame to Hive.
Update: You can now (by using HDP 3.1) create a DataFrame and if a Hive table that represents the DataFrame doesn't exist, the Hive Warehouse Connector creates it, as shown in the HDP 3.1 docs:
df = //Create DataFrame from any source
val hive = com.hortonworks.spark.sql.hive.llap.HiveWarehouseBuilder.session(spark).build()
df.write.format(HIVE_WAREHOUSE_CONNECTOR)
.option("table", "my_Table")
.save()
回答2:
As @catpaws said Spark and Hive use independent catalogues. To save dataframe with multiple columns with Hive Warehouse Connector you can use my function:
save_table_hwc(df1, "default", "table_test1")
def save_table_hwc(df: DataFrame, database: String, tableName: String) : Unit = {
hive.setDatabase(database)
hive.dropTable(tableName, true, false)
hive.createTable(tableName)
var table_builder = hive.createTable(tableName)
for( i <- 0 to df.schema.length-1){
var name = df.schema.toList(i).name.replaceAll("[^\\p{L}\\p{Nd}]+", "")
var data_type = df.schema.toList(i).dataType.sql
table_builder = table_builder.column(name, data_type)
}
table_builder.create()
df.write.format(HIVE_WAREHOUSE_CONNECTOR).option("table", tableName).save()
}
来源:https://stackoverflow.com/questions/53323964/cant-save-table-to-hive-metastore-hdp-3-0