Spark and Hive table schema out of sync after external overwrite

问题

I'm am having issues with the schema for Hive tables being out of sync between Spark and Hive on a Mapr cluster with Spark 2.1.0 and Hive 2.1.1.

I need to try to resolve this problem specifically for managed tables, but the issue can be reproduced with unmanaged/external tables.

Overview of Steps

Use saveAsTable to save a dataframe to a given table.
Use mode("overwrite").parquet("path/to/table") to overwrite the data for the previously saved table. I am actually modifying the data through a process external to Spark and Hive, but this reproduces the same issue.
Use spark.catalog.refreshTable(...) to refresh metadata
Query the table with spark.table(...).show(). Any columns that were the same between the original dataframe and the overwriting one will show the new data correctly, but any columns that were only in the new table will not be displayed.

Example

db_name = "test_39d3ec9"
table_name = "overwrite_existing"
table_location = "<spark.sql.warehouse.dir>/{}.db/{}".format(db_name, table_name)

qualified_table = "{}.{}".format(db_name, table_name)
spark.sql("CREATE DATABASE IF NOT EXISTS {}".format(db_name))

Save as a managed table

existing_df = spark.createDataFrame([(1, 2)])
existing_df.write.mode("overwrite").saveAsTable(table_name)

Note that saving as an unmanaged table with the following will produce the same issue:

existing_df.write.mode("overwrite") \
    .option("path", table_location) \
    .saveAsTable(qualified_table)

View the contents of the table

spark.table(table_name).show()
+---+---+
| _1| _2|
+---+---+
|  1|  2|
+---+---+

Overwrite the parquet files directly

new_df = spark.createDataFrame([(3, 4, 5, 6)], ["_4", "_3", "_2", "_1"])
new_df.write.mode("overwrite").parquet(table_location)

View the contents with the parquet reader, the contents show correctly

spark.read.parquet(table_location).show()
+---+---+---+---+
| _4| _3| _2| _1|
+---+---+---+---+
|  3|  4|  5|  6|
+---+---+---+---+

Refresh spark's metadata for the table and read in again as a table. The data will be updated for the columns that were the same, but the additional columns do not display.

spark.catalog.refreshTable(qualified_table)
spark.table(qualified_table).show()
+---+---+
| _1| _2|
+---+---+
|  6|  5|
+---+---+

I have also tried updating the schema in hive before calling spark.catalog.refreshTable with the below command in the hive shell:

ALTER TABLE test_39d3ec9.overwrite_existing REPLACE COLUMNS (`_1` bigint, `_2` bigint, `_3` bigint, `_4` bigint);

After running the ALTER command I then run describe and it shows correctly in hive

DESCRIBE test_39d3ec9.overwrite_existing
OK
_1                      bigint
_2                      bigint
_3                      bigint
_4                      bigint

Before running the alter command it only shows the original columns as expected

DESCRIBE test_39d3ec9.overwrite_existing
OK
_1                      bigint
_2                      bigint

I then ran spark.catalog.refreshTable but it didn't effect spark's view of the data.

Additional Notes

From the spark side, I did most of my testing with PySpark, but also tested in a spark-shell (scala) and a sparksql shell. While in the spark shell I also tried using a HiveContext but didn't work.

import org.apache.spark.sql.hive.HiveContext
import spark.sqlContext.implicits._
val hiveObj = new HiveContext(sc)
hiveObj.refreshTable("test_39d3ec9.overwrite_existing")

After performing the ALTER command in the hive shell, I verified in Hue that the schema also changed there.

I also tried running the ALTER command with spark.sql("ALTER ...") but the version of Spark we are on (2.1.0) does not allow it, and looks like it won't be available until Spark 2.2.0 based on this issue: https://issues.apache.org/jira/browse/SPARK-19261

I have also read through the spark docs again, specifically this section: https://spark.apache.org/docs/2.1.0/sql-programming-guide.html#hive-metastore-parquet-table-conversion

Based on those docs, spark.catalog.refreshTable should work. The configuration for spark.sql.hive.convertMetastoreParquet is typically false, but I switched it to true for testing and it didn't seem to effect anything.

Any help would be appreciated, thank you!

回答1:

I faced a similar issue while using spark 2.2.0 in CDH 5.11.x package.

After spark.write.mode("overwrite").saveAsTable() when I issue spark.read.table().show no data will be displayed.

On checking I found it was a known issue with CDH spark 2.2.0 version. Workaround for that was to run the below command after the saveAsTable command was executed.

spark.sql("ALTER TABLE qualified_table set SERDEPROPERTIES ('path'='hdfs://{hdfs_host_name}/{table_path}')")

spark.catalog.refreshTable("qualified_table")

eg: If your table LOCATION is like hdfs://hdfsHA/user/warehouse/example.db/qualified_table
then assign 'path'='hdfs://hdfsHA/user/warehouse/example.db/qualified_table'

This worked for me. Give it a try. I assume by now your issue would have been resolved. If not you can try this method.

workaround source: https://www.cloudera.com/documentation/spark2/2-2-x/topics/spark2_known_issues.html

来源：https://stackoverflow.com/questions/49201436/spark-and-hive-table-schema-out-of-sync-after-external-overwrite

标签

apache-spark

Hive

pyspark

MapR