How to transform Scala nested map operation to Scala Spark operation?

后端未结

关注

 2  1799

Below code calculates eucleudian distance between two List in a dataset :

 val user1 = List(\"a\", \"1\", \"3\", \"2\", \"6\", \"9\")  //> user1  : List[Stri


                      
              相关标签:


      
      
        
          2条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  走了就别回头了        
                
              
                            
                2021-01-22 11:53
              
            
            
                                                                       
First of all I suggest you to move from storing you user model in list, to well typed class. And then I don't think you need to compute distance between the same users like (a-a) and (b-b), and no reason to compute distance twice (a-b) (b-a).

  val user1 = List("a", "1", "3", "2", "6", "9")
  val user2 = List("b", "1", "2", "2", "5", "9")

  case class User(name: String, features: Vector[Double])

  object User {
    def fromList(list: List[String]): User = list match {
      case h :: tail => User(h, tail.map(_.toDouble).toVector)
    }
  }

  def euclDistance(userA: User, userB: User) = {
    println(s"comparing ${userA.name} and ${userB.name}")
    val subElements = (userA.features zip userB.features) map {
      m => (m._1 - m._2) * (m._1 - m._2)
    }
    val summed = subElements.sum
    val sqRoot = Math.sqrt(summed)

    sqRoot
  }

  val all = List(User.fromList(user1), User.fromList(user2))


  val users: RDD[(User, User)] = sc.parallelize(all.combinations(2).toSeq.map {
    case l :: r :: Nil => (l, r)
  })

  users.foreach(t => euclDistance(t._1, t._2))

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  没有蜡笔的小新        
                
              
                            
                2021-01-22 11:55
              
            
            
                                                                       
The actual solution will depend on the dimensions of the dataset. Assuming that the original dataset fits in memory and you want to parallelize the computation of the euclidean distance, I'd proceed like this:

Assume users is the list of users by some id and userData is the data to be processed per user indexed by id.

// sc is the Spark Context
type UserId = String
type UserData = Array[Double]

val users: List[UserId]= ???
val data: Map[UserId,UserData] = ???
// combination generates the unique pairs of users for which distance makes sense
// given that euclidDistance (a,b) = eclidDistance(b,a) only (a,b) is in this set
def combinations[T](l: List[T]): List[(T,T)] = l match {
    case Nil => Nil
    case h::Nil => Nil
    case h::t => t.map(x=>(h,x)) ++ comb(t)
}

// broadcasts the data to all workers
val broadcastData = sc.broadcast(data)
val usersRdd = sc.parallelize(combinations(users))
val euclidDistance: (UserData, UserData) => Double = (x,y) => 
    math.sqrt((x zip y).map{case (a,b) => math.pow(a-b,2)}.sum)
val userDistanceRdd = usersRdd.map{ case (user1, user2) => {
        val data = broadcastData.value
        val distance = euclidDistance(data(user1), data(user2))
        ((user1, user2),distance)
    }


In case that the user data is too large, instead of using  a broadcast variable, you would load that from external storage. 
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复