Dataframes reading json files with changing schema

半腔热情 提交于 2020-01-25 06:24:00

问题


I am currently reading json files which have variable schema in each file. We are using the following logic to read json - first we read the base schema which has all fields and then read the actual data. We are using this approach because the schema is understood based on the first file read, but we are not getting all the fields in the first file it self. So just tricking the code to understand the schema first and then start reading the actual data.

rdd=sc.textFile(baseSchemaWithAllColumns.json).union(pathToActualFile.json)
sqlContext.read.json(rdd)

//Create dataframe and then save as temp table and query

I know the above is just work around and we need a cleaner solution to accept json files with varying schema.

I understand that there are two other ways to understand schema as mentioned here

However, for that it looks like we need to parse the json and map each field to the data received.

There seems to be an option for parquet schema merger, but that looks like mostly at the reading from the dataframe - or am I missing something here.

What is the best way to read a changing schema of json files and work with Spark SQL for querying.

Can I just read the json file as is and save as temp table and then use mergeSchema=true while querying

来源:https://stackoverflow.com/questions/35995785/dataframes-reading-json-files-with-changing-schema

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!