Sparse vector RDD in pyspark

问题

I have been implementing the TF-IDF method described here with Python/Pyspark using feature from mllib:

https://spark.apache.org/docs/1.3.0/mllib-feature-extraction.html

I have a training set of 150 text documents, a testing set of 80 text documents. I have produced a hash table TF-IDF RDD (of sparse vectors) for both training and testing i.e. bag of words representation called tfidf_train and tfidf_test. The IDF is shared between both and is based solely on the training data. My question concerns how to handle sparse RDD's, there is very little information out there.

I would now like to efficiently map each one of the 80 test document TF-IDF vectors to the training TF-IDF vector that it shares highest cosine similarity. By enacting tfidf_test.first(), I see that each sparse TF-IDF vector (making up both RDD's) look something like this:

SparseVector(1048576, {0: 15.2313, 9377: 8.6483, 16538: 4.3241, 45005: 4.3241, 67046: 5.0173, 80280: 4.3241, 83104: 2.9378, 83107: 3.0714, 87638: 3.9187, 90331: 3.9187, 110522: 1.7592, 138394: 3.631, 140318: 4.3241, 147576: 4.3241, 165673: 4.3241, 172912: 3.9187, 179664: 4.3241, 179767: 5.0173, 189356: 1.047, 190616: 4.3241, 192712: 4.3241, 193790: 3.4078, 220545: 3.9187, 221050: 3.4078, 229110: 3.4078, 232286: 2.0728, 240477: 3.631, 241582: 4.3241, 242620: 3.9187, 245388: 5.0173, 252569: 2.8201, 255985: 5.0173, 266130: 4.3241, 277170: 3.9187, 277863: 4.3241, 298406: 4.3241, 323505: 4.3241, 326993: 3.2255, 330297: 4.3241, 334392: 3.4078, 354917: 3.631, 355604: 3.9187, 365855: 4.3241, 383386: 2.9378, 386534: 4.3241, 387896: 3.2255, 392015: 4.3241, 395372: 1.4619, 406995: 3.4078, 414351: 5.0173, 433323: 4.3241, 434512: 4.3241, 438171: 4.3241, 439468: 4.3241, 453414: 3.9187, 454316: 4.3241, 456931: 3.9187, 461229: 3.631, 488050: 5.0173, 506649: 4.3241, 508845: 3.0714, 512698: 4.3241, 526484: 8.6483, 548929: 2.8201, 549530: 4.3241, 550044: 3.631, 555900: 4.3241, 557206: 6.451, 570917: 1.8392, 618498: 3.4078, 623040: 3.5968, 637333: 4.3241, 645028: 2.9378, 669449: 3.0714, 676506: 4.3241, 699388: 4.3241, 702049: 2.3782, 715677: 3.4078, 733071: 3.9187, 738831: 3.631, 743497: 8.6483, 782907: 1.047, 793071: 4.3241, 801052: 4.3241, 805189: 3.2255, 811506: 4.3241, 812013: 4.3241, 819994: 4.3241, 837270: 4.3241, 848755: 3.9187, 852042: 4.3241, 866553: 4.3241, 872996: 3.2255, 908183: 5.0173, 914226: 8.6483, 921216: 4.3241, 925934: 4.3241, 927892: 4.3241, 935542: 5.0173, 941563: 1.0855, 958430: 3.4078, 959994: 1.7984, 977239: 3.9187, 978895: 3.0714, 1001818: 3.2255, 1002343: 3.2255, 1016145: 4.3241, 1017725: 4.3241, 1031685: 8.1441})

I am unsure how to compare between the RDDs but I think that reduceByKey(lambda x,y: x*y) may be useful. Does anyone have any ideas how to scan each test vector through and output to a tuple of (vector matched to from training set, cosine similarity value)?

Any help appreciated!

来源：https://stackoverflow.com/questions/31946507/sparse-vector-rdd-in-pyspark

标签

apache-spark

pyspark

sparse-matrix

apache-spark-mllib

tf-idf