How to split multi-value column into separate rows using typed Dataset?

后端未结

关注

 3  1929

I am facing an issue of how to split a multi-value column, i.e. List[String], into separate rows.

The initial dataset has following types: Dataset


                      
              相关标签:


      
      
        
          3条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  迷失自我        
                
              
                            
                2021-01-05 07:32
              
            
            
                                                                       
explode is often suggested, but it's from the untyped DataFrame API and given you use Dataset, I think flatMap operator might be a better fit (see org.apache.spark.sql.Dataset).

flatMap[U](func: (T) ⇒ TraversableOnce[U])(implicit arg0: Encoder[U]): Dataset[U]



  (Scala-specific) Returns a new Dataset by first applying a function to all elements of this Dataset, and then flattening the results.


You could use it as follows:

val ds = Seq(
  (0, "Lorem ipsum dolor", 1.0, Array("prp1", "prp2", "prp3")))
  .toDF("id", "text", "value", "properties")
  .as[(Integer, String, Double, scala.List[String])]

scala> ds.flatMap { t => 
  t._4.map { prp => 
    (t._1, t._2, t._3, prp) }}.show
+---+-----------------+---+----+
| _1|               _2| _3|  _4|
+---+-----------------+---+----+
|  0|Lorem ipsum dolor|1.0|prp1|
|  0|Lorem ipsum dolor|1.0|prp2|
|  0|Lorem ipsum dolor|1.0|prp3|
+---+-----------------+---+----+

// or just using for-comprehension
for {
  t <- ds
  prp <- t._4
} yield (t._1, t._2, t._3, prp)

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  没有蜡笔的小新        
                
              
                            
                2021-01-05 07:48
              
            
            
                                                                       
Here's one way to do it:

val myRDD = sc.parallelize(Array(
  (0, "text0", 1.0, List("prp1", "prp2", "prp3")),
  (1, "text1", 2.0, List("prp4", "prp5", "prp6")),
  (2, "text2", 3.0, List("prp7", "prp8", "prp9"))
)).map{
  case (i, t, v, ps) => ((i, t, v), ps)
}.flatMapValues(x => x).map{
  case ((i, t, v), p) => (i, t, v, p)
}

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  不知归路        
                
              
                            
                2021-01-05 07:50
              
            
            
                                                                       
You can use explode:

df.withColumn("property", explode($"property"))


Example:

val df = Seq((1, List("a", "b"))).toDF("A", "B")   
// df: org.apache.spark.sql.DataFrame = [A: int, B: array<string>]

df.withColumn("B", explode($"B")).show
+---+---+
|  A|  B|
+---+---+
|  1|  a|
|  1|  b|
+---+---+

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复