save Spark dataframe to Hive: table not readable because “parquet not a SequenceFile”

前端 未结 4 1206
走了就别回头了
走了就别回头了 2020-12-28 22:08

I\'d like to save data in a Spark (v 1.3.0) dataframe to a Hive table using PySpark.

The documentation states:

\"spark.sql.hive.convertMetasto

相关标签:
4条回答
  • 2020-12-28 22:21

    I have done in pyspark, spark version 2.3.0 :

    create empty table where we need to save/overwrite data like:

    create table databaseName.NewTableName like databaseName.OldTableName;
    

    then run below command:

    df1.write.mode("overwrite").partitionBy("year","month","day").format("parquet").saveAsTable("databaseName.NewTableName");
    

    The issue is you can't read this table with hive but you can read with spark.

    0 讨论(0)
  • 2020-12-28 22:26

    I hit this issue last week and was able to find a workaround

    Here's the story: I can see the table in Hive if I created the table without partitionBy:

    spark-shell>someDF.write.mode(SaveMode.Overwrite)
                      .format("parquet")
                      .saveAsTable("TBL_HIVE_IS_HAPPY")
    
    hive> desc TBL_HIVE_IS_HAPPY;
          OK
          user_id                   string                                      
          email                     string                                      
          ts                        string                                      
    

    But Hive can't understand the table schema(schema is empty...) if I do this:

    spark-shell>someDF.write.mode(SaveMode.Overwrite)
                      .format("parquet")
                      .saveAsTable("TBL_HIVE_IS_NOT_HAPPY")
    
    hive> desc TBL_HIVE_IS_NOT_HAPPY;
          # col_name                data_type               from_deserializer  
    

    [Solution]:

    spark-shell>sqlContext.sql("SET spark.sql.hive.convertMetastoreParquet=false")
    spark-shell>df.write
                  .partitionBy("ts")
                  .mode(SaveMode.Overwrite)
                  .saveAsTable("Happy_HIVE")//Suppose this table is saved at /apps/hive/warehouse/Happy_HIVE
    
    
    hive> DROP TABLE IF EXISTS Happy_HIVE;
    hive> CREATE EXTERNAL TABLE Happy_HIVE (user_id string,email string,ts string)
                                           PARTITIONED BY(day STRING)
                                           STORED AS PARQUET
                                           LOCATION '/apps/hive/warehouse/Happy_HIVE';
    hive> MSCK REPAIR TABLE Happy_HIVE;
    

    The problem is that the datasource table created through Dataframe API(partitionBy+saveAsTable) is not compatible with Hive.(see this link). By setting spark.sql.hive.convertMetastoreParquet to false as suggested in the doc, Spark only puts data onto HDFS,but won't create table on Hive. And then you can manually go into hive shell to create an external table with proper schema&partition definition pointing to the data location. I've tested this in Spark 1.6.1 and it worked for me. I hope this helps!

    0 讨论(0)
  • 2020-12-28 22:44

    I've been there...
    The API is kinda misleading on this one.
    DataFrame.saveAsTable does not create a Hive table, but an internal Spark table source.
    It also stores something into Hive metastore, but not what you intend.
    This remark was made by spark-user mailing list regarding Spark 1.3.

    If you wish to create a Hive table from Spark, you can use this approach:
    1. Use Create Table ... via SparkSQL for Hive metastore.
    2. Use DataFrame.insertInto(tableName, overwriteMode) for the actual data (Spark 1.3)

    0 讨论(0)
  • 2020-12-28 22:44

    metadata doesn't already exist. In other words, it will add any partitions that exist on HDFS but not in metastore, to the hive metastore.

    0 讨论(0)
提交回复
热议问题