Spark AnalysisException when “flattening” DataFrame in Spark SQL

前端 未结 1 2004
慢半拍i
慢半拍i 2021-02-10 08:55

I\'m using the approach given here to flatten a DataFrame in Spark SQL. Here is my code:

package com.acme.etl.xml

im         


        
相关标签:
1条回答
  • 2021-02-10 08:58

    Your document contains a multi-valued array so you can't flatten it completely in one pass since you can't give both elements of the array the same column name. Also, it's usually a bad idea to use a dot within a column name since it can easily confuse the Spark parser and will need to be escaped at all time.

    The usual way to flatten such a dataset is to create new rows for each element of the array. You can use the explode function to do this but you will need to recursively call your flatten operation because explode can't be nested.

    The following code works as expected, using '_' instead of '.' as column name separator:

    import org.apache.spark.sql.types._ 
    import org.apache.spark.sql.{Column, SparkSession}
    import org.apache.spark.sql.{Dataset, Row}
    
    object RuntimeError {   
    
      def main(args: Array[String]): Unit = {
        val spark = SparkSession.builder().appName("FlattenSchema").getOrCreate()
        val rowTag = "idocData"
        val dataFrameReader = spark.read.option("rowTag", rowTag)
        val xmlUri = "bad_011_1.xml"
        val df = dataFrameReader.format("xml").load(xmlUri)
    
        val df2 = flatten(df)
    
      }
    
      def flatten(df: Dataset[Row], prefixSeparator: String = "_") : Dataset[Row] = {
        import org.apache.spark.sql.functions.{col,explode}
    
        def mustFlatten(sc: StructType): Boolean =
          sc.fields.exists(f => f.dataType.isInstanceOf[ArrayType] || f.dataType.isInstanceOf[StructType])
    
        def flattenAndExplodeOne(sc: StructType, parent: Column = null, prefix: String = null, cols: Array[(DataType,Column)] = Array[(DataType,Column)]()): Array[(DataType,Column)] = {
          val res = sc.fields.foldLeft(cols)( (columns, f) => {
            val my_col = if (parent == null) col(f.name) else parent.getItem(f.name)
            val flat_name = if (prefix == null) f.name else s"${prefix}${prefixSeparator}${f.name}"
            f.dataType match {
              case st: StructType => flattenAndExplodeOne(st, my_col, flat_name, columns)
    
              case dt: ArrayType => {
                if (columns.exists(_._1.isInstanceOf[ArrayType])) {
                  columns :+ ((dt,  my_col.as(flat_name)))
                } else {
                  columns :+ ((dt, explode(my_col).as(flat_name)))
                }
              }
              case dt => columns :+ ((dt, my_col.as(flat_name)))
            }
          })
          res
        }
    
        var flatDf = df
        while (mustFlatten(flatDf.schema)) {
          val newColumns = flattenAndExplodeOne(flatDf.schema, null, null).map(_._2)
          flatDf = flatDf.select(newColumns:_*)
        }
    
        flatDf
      }
    }
    

    The resulting df2 has the following schema and data:

    df2.printSchema
    root
     |-- E2EDP01008GRP_E2EDPT1001GRP_E2EDPT2001_DATAHEADERCOLUMN_DOCNUM: long (nullable = true)
     |-- E2EDP01008GRP__xmlns: string (nullable = true)
    
    
    df2.show(true)
    +--------------------------------------------------------------+--------------------+
    |E2EDP01008GRP_E2EDPT1001GRP_E2EDPT2001_DATAHEADERCOLUMN_DOCNUM|E2EDP01008GRP__xmlns|
    +--------------------------------------------------------------+--------------------+
    |                                                     141036013|http://Microsoft....|
    |                                                     141036013|http://Microsoft....|
    +--------------------------------------------------------------+--------------------+
    
    0 讨论(0)
提交回复
热议问题