问题
I was wondering whether it would be possible for Spark Cosine Similarity to work with Sparse input data? I have seen examples wherein the input consists of lines of space-separated features of the form:
id feat1 feat2 feat3 ...
but I have an inherently sparse, implicit feedback setting and would like to have input in the form:
id1 feat1:1 feat5:1 feat10:1
id2 feat3:1 feat5:1 ..
...
I would like to make use of the sparsity to improve the calculation. Also ultimately I wish to use the DIMSUM algorithm for calculating all-pairs-similarity that has been recently incorporated into Spark. Could someone suggest a sparse-input format that would work with DIMSUM on spark? I checked the example code and in the comments it says "The input must be a dense matrix" but this code was in examples so I don't know whether it was referring only to one particular case.
spark/examples/src/main/scala/org/apache/spark/examples/mllib/CosineSimilarity.scala
That's the path to the example code that I'm referring to.
Just a couple of lines representing how the sparse-input format should look (from a recommendation system perspective, user_id feat1:1 feat2:1 ...), to work with cosine similarity, would be extremely helpful.
Also would it be okay if I left the user_ids as strings?
I am aware that libsvm format is similar but there is no notion of a user id in this case, only input instances with features so I was wondering how the libsvm format would translate into a recommendation system domain?
My apologies for the extremely simplistic questions, I am extremely new to Spark and am just getting my feet wet.
Any help would be much appreciated. Thanks in advance!
回答1:
Why not? Naive solution can look more or less like this:
// Parse input line
def parseLine(line: String) = {
def parseFeature(feature: String) = {
feature.split(":") match {
case Array(k, v) => (k, v.toDouble)
}
}
val bits = line.split(" ")
val id = bits.head
val features = bits.tail.map(parseFeature).toMap
(id, features)
}
// Compute dot product between to dicts
def dotProduct(x: Map[String, Double], y: Map[String, Double]): Double = ???
// Compute norm of dict
def norm(x: Map[String, Double]): Double = ???
// Compute cosine similarity
def sparseCosine(x: Map[String, Double], y: Map[String, Double]): Double = {
dotProduct(x, y) / (norm(x) * norm(y))
}
// Parse input lines
val parsed = sc.textFile("features.txt").map(parseLine)
// Find unique pairs
val pairs = parsed.cartesian(parsed).filter(x => x._1._1 != x._2._1)
// Compute cosine similarity between pairs
pairs.map { case ((k1, m1), (k2, m2)) => ((k1, k2), sparseCosine(m1, m2)) }
来源:https://stackoverflow.com/questions/30060206/spark-cosine-similarity-dimsum-algorithm-sparse-input-file