Adding part of the parent Schema column to child in nested json in spark data frame

前端未结
关注
 1  1751
轮回少年 2021-01-17 06:01
I have below xml that i am trying to load in to spark data frame.

      
      
        
          1条回答        

        
                    
            
            
                         
                
              
              
                
                   囚心锁ツ
                                             
                
                
                (楼主)
            
              
              
                2021-01-17 06:49
              

            
            
                        
If you are looking to get two dataframes: one for the Source and one for the Auditors with organizationId and sourceId of Source dataframe, then you can use following logic.

Observing the given data and your attempts, I can suggest that a explode function on env:Body.env:ContentItem column would give you the parent dataframe

import sqlContext.implicits._
import org.apache.spark.sql.functions._
val dfContentEnvelope = sqlContext.read.format("com.databricks.spark.xml")
  .option("rowTag", "env:ContentEnvelope")
  .load("s3://trfsmallfffile/XML")

val dfContentItem = dfContentEnvelope.withColumn("column1", explode(dfContentEnvelope("env:Body.env:ContentItem"))).select("column1.*")
val ParentDF=dfContentItem.select($"env:Data.sr:Source._organizationId".as("organizationId"), $"env:Data.sr:Source._sourceId".as("sourceId"), $"env:Data.sr:Source".as("Source"))


which would give you 

+--------------+--------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|organizationId|sourceId|Source                                                                                                                                                                                                                                                 |
+--------------+--------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|4295906830    |344     |[4295906830,344,[WrappedArray([3541,3024068,UNQ,3010546,true,false,false], [9574,3030421,UWE,3010547,true,false,false])],20171030T00:00:00+00:00,false,false,1.0,20171111T17:00:00+00:00,300,false,10K,3011835,20171030T00:00:00+00:00,SS,1000716240,1]|
+--------------+--------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+


And for the child dataframe you would need to explode the sr:Auditor from the above parent dataframe as 

val childDF=ParentDF.select($"organizationId", $"sourceId", explode($"Source.sr:Auditors.sr:Auditor").as("Auditors"))


which should give you 

+--------------+--------+-------------------------------------------+
|organizationId|sourceId|Auditors                                   |
+--------------+--------+-------------------------------------------+
|4295906830    |344     |[3541,3024068,UNQ,3010546,true,false,false]|
|4295906830    |344     |[9574,3030421,UWE,3010547,true,false,false]|
+--------------+--------+-------------------------------------------+


I hope the answer is helpful
    
             
                                                        
            
            
              
                
                0
              
                   
                
               讨论(0)
              
                                                  
              
              
                          
             
       
          
              
                                    
                         
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
                              			
        
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复