Spark SQL on ORC files doesn't return correct Schema (Column names)

后端 未结 5 864
甜味超标
甜味超标 2020-12-21 09:55

I have a directory containing ORC files. I am creating a DataFrame using the below code

var data = sqlContext.sql(\"SELECT * FROM orc.`/directory/containing/         


        
相关标签:
5条回答
  • 2020-12-21 09:57

    If you have the parquet version as well, you can just copy the column names over, which is what I did (also, the date column was partition key for orc so had to move it to the end):

    tx = sqlContext.table("tx_parquet")
    df = sqlContext.table("tx_orc")
    tx_cols = tx.schema.names
    tx_cols.remove('started_at_date')
    tx_cols.append('started_at_date') #move it to end
    #fix column names for orc
    oldColumns = df.schema.names
    newColumns = tx_cols
    df = functools.reduce(
        lambda df, idx: df.withColumnRenamed(
            oldColumns[idx], newColumns[idx]), range(
                len(oldColumns)), df)
    
    0 讨论(0)
  • 2020-12-21 10:03

    We can use:

    val df = hiveContext.read.table("tableName")

    Your df.schema or df.columns will give actual column names.

    0 讨论(0)
  • 2020-12-21 10:13

    The problem is the Hive version, which is 1.2.1, which has this bug HIVE-4243

    This was fixed in 2.0.0.

    0 讨论(0)
  • 2020-12-21 10:18

    Setting

    sqlContext.setConf('spark.sql.hive.convertMetastoreOrc', 'false')
    

    fixes this.

    0 讨论(0)
  • 2020-12-21 10:20

    If version upgrade is not an available option, quick fix could be to rewrite ORC file using PIG. That seems to work just fine.

    0 讨论(0)
提交回复
热议问题