I have a directory containing ORC files. I am creating a DataFrame using the below code
var data = sqlContext.sql(\"SELECT * FROM orc.`/directory/containing/
If you have the parquet version as well, you can just copy the column names over, which is what I did (also, the date column was partition key for orc so had to move it to the end):
tx = sqlContext.table("tx_parquet")
df = sqlContext.table("tx_orc")
tx_cols = tx.schema.names
tx_cols.remove('started_at_date')
tx_cols.append('started_at_date') #move it to end
#fix column names for orc
oldColumns = df.schema.names
newColumns = tx_cols
df = functools.reduce(
lambda df, idx: df.withColumnRenamed(
oldColumns[idx], newColumns[idx]), range(
len(oldColumns)), df)
We can use:
val df = hiveContext.read.table("tableName")
Your df.schema
or df.columns
will give actual column names.
The problem is the Hive version, which is 1.2.1, which has this bug HIVE-4243
This was fixed in 2.0.0.
Setting
sqlContext.setConf('spark.sql.hive.convertMetastoreOrc', 'false')
fixes this.
If version upgrade is not an available option, quick fix could be to rewrite ORC file using PIG. That seems to work just fine.