Adding part of the parent Schema column to child in nested json in spark data frame

前端 未结 1 1751
轮回少年
轮回少年 2021-01-17 06:01

I have below xml that i am trying to load in to spark data frame.

   

           


        
1条回答
  •  囚心锁ツ
    2021-01-17 06:49

    If you are looking to get two dataframes: one for the Source and one for the Auditors with organizationId and sourceId of Source dataframe, then you can use following logic.

    Observing the given data and your attempts, I can suggest that a explode function on env:Body.env:ContentItem column would give you the parent dataframe

    import sqlContext.implicits._
    import org.apache.spark.sql.functions._
    val dfContentEnvelope = sqlContext.read.format("com.databricks.spark.xml")
      .option("rowTag", "env:ContentEnvelope")
      .load("s3://trfsmallfffile/XML")
    
    val dfContentItem = dfContentEnvelope.withColumn("column1", explode(dfContentEnvelope("env:Body.env:ContentItem"))).select("column1.*")
    val ParentDF=dfContentItem.select($"env:Data.sr:Source._organizationId".as("organizationId"), $"env:Data.sr:Source._sourceId".as("sourceId"), $"env:Data.sr:Source".as("Source"))
    

    which would give you

    +--------------+--------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
    |organizationId|sourceId|Source                                                                                                                                                                                                                                                 |
    +--------------+--------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
    |4295906830    |344     |[4295906830,344,[WrappedArray([3541,3024068,UNQ,3010546,true,false,false], [9574,3030421,UWE,3010547,true,false,false])],20171030T00:00:00+00:00,false,false,1.0,20171111T17:00:00+00:00,300,false,10K,3011835,20171030T00:00:00+00:00,SS,1000716240,1]|
    +--------------+--------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
    

    And for the child dataframe you would need to explode the sr:Auditor from the above parent dataframe as

    val childDF=ParentDF.select($"organizationId", $"sourceId", explode($"Source.sr:Auditors.sr:Auditor").as("Auditors"))
    

    which should give you

    +--------------+--------+-------------------------------------------+
    |organizationId|sourceId|Auditors                                   |
    +--------------+--------+-------------------------------------------+
    |4295906830    |344     |[3541,3024068,UNQ,3010546,true,false,false]|
    |4295906830    |344     |[9574,3030421,UWE,3010547,true,false,false]|
    +--------------+--------+-------------------------------------------+
    

    I hope the answer is helpful

    0 讨论(0)
提交回复
热议问题