Below code calculates eucleudian distance between two List in a dataset :
val user1 = List(\"a\", \"1\", \"3\", \"2\", \"6\", \"9\") //> user1 : List[Stri
First of all I suggest you to move from storing you user model in list, to well typed class. And then I don't think you need to compute distance between the same users like (a-a) and (b-b), and no reason to compute distance twice (a-b) (b-a).
val user1 = List("a", "1", "3", "2", "6", "9")
val user2 = List("b", "1", "2", "2", "5", "9")
case class User(name: String, features: Vector[Double])
object User {
def fromList(list: List[String]): User = list match {
case h :: tail => User(h, tail.map(_.toDouble).toVector)
}
}
def euclDistance(userA: User, userB: User) = {
println(s"comparing ${userA.name} and ${userB.name}")
val subElements = (userA.features zip userB.features) map {
m => (m._1 - m._2) * (m._1 - m._2)
}
val summed = subElements.sum
val sqRoot = Math.sqrt(summed)
sqRoot
}
val all = List(User.fromList(user1), User.fromList(user2))
val users: RDD[(User, User)] = sc.parallelize(all.combinations(2).toSeq.map {
case l :: r :: Nil => (l, r)
})
users.foreach(t => euclDistance(t._1, t._2))