Calculating cosine similarity by featurizing the text into vector using tf-idf

烂漫一生 提交于 2019-12-06 09:19:50

问题


I'm new to Apache Spark, want to find the similar text from a bunch of text, have tried myself as follows -

I have 2 RDD-

1st RDD contain incomplete text as follows -

[0,541 Suite 204, Redwood City, CA 94063]
[1,6649 N Blue Gum St, New Orleans,LA, 70116]
[2,#69, Los Angeles, Los Angeles, CA, 90034]
[3,98 Connecticut Ave Nw, Chagrin Falls]
[4,56 E Morehead Webb, TX, 78045]

2nd RDD contain correct address as follows -

[0,541 Jefferson Avenue, Suite 204, Redwood City, CA 94063]
[1,6649 N Blue Gum St, New Orleans, Orleans, LA, 70116]
[2,25 E 75th St #69, Los Angeles, Los Angeles, CA, 90034]
[3,98 Connecticut Ave Nw, Chagrin Falls, Geauga, OH, 44023]
[4,56 E Morehead St, Laredo, Webb, TX, 78045]

Have written this code, it is taking lot of time, can anyone please tell me the correct way of doing this in Apache Spark using scala.

val incorrect_address_count = incorrect_address_rdd.count()
val all_address = incorrect_address_rdd.union(correct_address_rdd) map (_._2.split(" ").toSeq)

val hashingTF = new HashingTF()
val tf = hashingTF.transform(all_address)
.zipWithIndex()

val input_vector_rdd = tf.filter(_._2 < incorrect_address_count)

val address_db_vector_rdd = tf.filter(_._2 >= incorrect_address_countt)
.map(f => (f._2 - input_count, f._1))
.join(correct_address_rdd)
.map(f => (f._2._1, f._2._2))

val input_similarity_rdd = input_vector_rdd.cartesian(address_db_vector_rdd)
.map(f => {

val cosine_similarity = cosineSimilarity(f._1._1.toDense, f._2._1.toDense)

(f._1._2, cosine_similarity, f._2._2)
})


def cosineSimilarity(vectorA: Vector, vectorB: Vector) = {

var dotProduct = 0.0
var normA = 0.0
var normB = 0.0
var index = vectorA.size - 1

for (i <- 0 to index) {
dotProduct += vectorA(i) * vectorB(i)
normA += Math.pow(vectorA(i), 2)
normB += Math.pow(vectorB(i), 2)
}
(dotProduct / (Math.sqrt(normA) * Math.sqrt(normB)))
}

回答1:


I had nearly same problem. I had 370K row and 2 vectors of 300K and 400K for each row. I am multiplying test rdd rows with both these vectors.

There are 2 big improvements you can do. One is pre-calculate norms. They do not change. Second is use sparse vector. You go with vector.size it is 300K if you do it like that. If you use Sparse it is iterating for every keyword.(20-30 per row).

Also I am afraid this is most efficient way because calculations do not need to shuffle.If you have a good estimation at the end you can filter by score and things will be fast.(I mean which score is enough for you.)

def cosineSimilarity(vectorA: SparseVector, vectorB:SparseVector,normASqrt:Double,normBSqrt:Double) :(Double,Double) = {
  var dotProduct = 0.0
  for (i <-  vectorA.indices){ 
    dotProduct += vectorA(i) * vectorB(i)
  }
  val div = (normASqrt * normBSqrt)
  if( div == 0 )
    (dotProduct,0)
  else
    (dotProduct,dotProduct / div)
}


来源:https://stackoverflow.com/questions/32645231/calculating-cosine-similarity-by-featurizing-the-text-into-vector-using-tf-idf

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!