flattening of nested json using spark scala creating 2 column with same name and giving error of duplicate in Phoenix

久未见 提交于 2020-01-24 22:56:10


I was trying to flatten the very nested JSON, and create spark dataframe and the ultimate goal is to push the given dataframe to phoenix. I am successfully able to flatten the JSON using code.

def recurs(df: DataFrame): DataFrame = {
  if(df.schema.fields.find(_.dataType match {
    case ArrayType(StructType(_),_) | StructType(_) => true
    case _ => false
  }).isEmpty) df
  else {
    val columns = df.schema.fields.map(f => f.dataType match {
      case _: ArrayType => explode(col(f.name)).as(f.name)
      case s: StructType => col(s"${f.name}.*")
      case _ => col(f.name)
val df = spark.read.json(json_location)
flatten_df = recurs(df)

My nested json is something like:

           "Total Value": 3,
           "Topic": "Example",
           "values": [
                        "value": "#example1",
                        "points": [
                        "properties": {
                         "date": "12-04-19",
                         "value": "Model example 1"
                       {"value": "#example2",
                        "points": [
                        "properties": {
                         "date": "12-05-19",
                         "value": "Model example 2"

The output I am getting:

|Total Value| Topic     |value     | points      | date                   |    value               |
| 3         | Example   | example1 | [123,156]   | 12-04-19               |    Model example 1     | 
| 3         | Example   | example2 | [124,157]   | 12-05-19               |    Model example 2     |       

So, value key is found 2 times in json so it is creating 2 column name but this is an error and not allowed in Phoenix to ingest this data.

The error message is:

ERROR 514 (42892): A duplicate column name was detected in the object definition or ALTER TABLE/VIEW statement

I am expecting this output so that phoenix could differentiate the columns.

|Total Value| Topic     |values.value  | values.points | values.properties.date | values.properties.value|              |
| 3         | Example   | example1     | [123,156]     | 12-04-19               |    Model example 1     | 
| 3         | Example   | example2     | [124,157]     | 12-05-19               |    Model example 2     |       

In this way phoenix can ingest the data perfectly, please suggest any changes in flattening code or any help to achieve the same. Thanks


You need slight changes to the recurs method:

  1. Dealing with ArrayType(st: StructType, _) instead of ArrayType.
  2. Avoid using *, and name every field in the second match (StructType).
  3. Use backticks at the right places to rename the fields, keeping precedence naming.

Here's some code:

def recurs(df: DataFrame): DataFrame = {
  if(!df.schema.fields.exists(_.dataType match {
    case ArrayType(StructType(_),_) | StructType(_) => true
    case _ => false
  })) df
  else {
    val columns = df.schema.fields.flatMap(f => f.dataType match {
      case ArrayType(st: StructType, _) => Seq(explode(col(f.name)).as(f.name))
      case s: StructType =>
        s.fieldNames.map{sf => col(s"`${f.name}`.$sf").as(s"${f.name}.$sf")}
      case _ => Seq(col(s"`${f.name}`"))

val newDF = recurs(df).cache

And the new output:

|Topic  |Total Value|values.points|values.properties.date|values.properties.value|values.value|
|Example|3          |[[123, 156]] |12-04-19              |Model example 1        |#example1   |
|Example|3          |[[124, 157]] |12-05-19              |Model example 2        |#example2   |

 |-- Topic: string (nullable = true)
 |-- Total Value: long (nullable = true)
 |-- values.points: array (nullable = true)
 |    |-- element: array (containsNull = true)
 |    |    |-- element: string (containsNull = true)
 |-- values.properties.date: string (nullable = true)
 |-- values.properties.value: string (nullable = true)
 |-- values.value: string (nullable = true)

