Calculating cosine similarity by featurizing the text into vector using tf-idf

问题

I'm new to Apache Spark, want to find the similar text from a bunch of text, have tried myself as follows -

I have 2 RDD-

1st RDD contain incomplete text as follows -

[0,541 Suite 204, Redwood City, CA 94063]
[1,6649 N Blue Gum St, New Orleans,LA, 70116]
[2,#69, Los Angeles, Los Angeles, CA, 90034]
[3,98 Connecticut Ave Nw, Chagrin Falls]
[4,56 E Morehead Webb, TX, 78045]

2nd RDD contain correct address as follows -

[0,541 Jefferson Avenue, Suite 204, Redwood City, CA 94063]
[1,6649 N Blue Gum St, New Orleans, Orleans, LA, 70116]
[2,25 E 75th St #69, Los Angeles, Los Angeles, CA, 90034]
[3,98 Connecticut Ave Nw, Chagrin Falls, Geauga, OH, 44023]
[4,56 E Morehead St, Laredo, Webb, TX, 78045]

Have written this code, it is taking lot of time, can anyone please tell me the correct way of doing this in Apache Spark using scala.

val incorrect_address_count = incorrect_address_rdd.count()
val all_address = incorrect_address_rdd.union(correct_address_rdd) map (_._2.split(" ").toSeq)

val hashingTF = new HashingTF()
val tf = hashingTF.transform(all_address)
.zipWithIndex()

val input_vector_rdd = tf.filter(_._2 < incorrect_address_count)

val address_db_vector_rdd = tf.filter(_._2 >= incorrect_address_countt)
.map(f => (f._2 - input_count, f._1))
.join(correct_address_rdd)
.map(f => (f._2._1, f._2._2))

val input_similarity_rdd = input_vector_rdd.cartesian(address_db_vector_rdd)
.map(f => {

val cosine_similarity = cosineSimilarity(f._1._1.toDense, f._2._1.toDense)

(f._1._2, cosine_similarity, f._2._2)
})


def cosineSimilarity(vectorA: Vector, vectorB: Vector) = {

var dotProduct = 0.0
var normA = 0.0
var normB = 0.0
var index = vectorA.size - 1

for (i <- 0 to index) {
dotProduct += vectorA(i) * vectorB(i)
normA += Math.pow(vectorA(i), 2)
normB += Math.pow(vectorB(i), 2)
}
(dotProduct / (Math.sqrt(normA) * Math.sqrt(normB)))
}

回答1:

I had nearly same problem. I had 370K row and 2 vectors of 300K and 400K for each row. I am multiplying test rdd rows with both these vectors.

There are 2 big improvements you can do. One is pre-calculate norms. They do not change. Second is use sparse vector. You go with vector.size it is 300K if you do it like that. If you use Sparse it is iterating for every keyword.(20-30 per row).

Also I am afraid this is most efficient way because calculations do not need to shuffle.If you have a good estimation at the end you can filter by score and things will be fast.(I mean which score is enough for you.)

def cosineSimilarity(vectorA: SparseVector, vectorB:SparseVector,normASqrt:Double,normBSqrt:Double) :(Double,Double) = {
  var dotProduct = 0.0
  for (i <-  vectorA.indices){ 
    dotProduct += vectorA(i) * vectorB(i)
  }
  val div = (normASqrt * normBSqrt)
  if( div == 0 )
    (dotProduct,0)
  else
    (dotProduct,dotProduct / div)
}

来源：https://stackoverflow.com/questions/32645231/calculating-cosine-similarity-by-featurizing-the-text-into-vector-using-tf-idf

标签

scala

apache-spark

tf-idf

cosine-similarity