Spark dataframes convert nested JSON to seperate columns

前端未结

关注

 3  740

I\'ve a stream of JSONs with following structure that gets converted to dataframe

{
  \"a\": 3936,
  \"b\": 123,
  \"c\": \"34\",
  \"attributes\": {
    \"


                      
              相关标签:


      
      
        
          3条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  遥遥无期        
                
              
                            
                2021-01-27 06:25
              
            
            
                                                                       
Use Python


Extract the DataFrame by using the pandas Lib of python.
Change the data type from 'str' to 'dict'.
Get the values of each features.
Save the results to a new file.

import pandas as pd

data = pd.read_csv("data.csv")  # load the csv file from your disk
json_data = data['Desc']        # get the DataFrame of Desc
data = data.drop('Desc', 1)     # delete Desc column
Total, Defective = [], []       # setout list

for i in json_data:
    i = eval(i)     # change the data type from 'str' to 'dict'
    Total.append(i['Total'])    # append 'Total' feature
    Defective.append(i['Defective'])    # append 'Defective' feature

# finally,complete the DataFrame
data['Total'] = Total
data['Defective'] = Defective

data.to_csv("result.csv")       # save to the result.csv and check it


                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  爱一瞬间的悲伤        
                
              
                            
                2021-01-27 06:27
              
            
            
                                                                       
Using the attributes.d notation, you can create new columns and you will have them in your DataFrame. Look at the withColumn() method in Java. 
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  执笔经年        
                
              
                            
                2021-01-27 06:30
              
            
            
                                                                       

If you want columns named from a to f:

df.select("a", "b", "c", "attributes.d", "attributes.e", "attributes.f")

If you want columns named with attributes. prefix:

df.select($"a", $"b", $"c", $"attributes.d" as "attributes.d", $"attributes.e" as "attributes.e", $"attributes.f" as "attributes.f")

If names of your columns are supplied from an external source (e.g. configuration):

val colNames: Seq("a", "b", "c", "attributes.d", "attributes.e", "attributes.f")

df.select(colNames.head, colNames.tail: _*).toDF(colNames:_*)


                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复