问题
As a part of a recommender system that I am building, I want to implement a item-item recommendation based on cosine similarity. Ideally, I would like to compute the cosine similarity on 1 million items represented by a DenseVector of 2048 features in order to get the top-n most similar items to a given one.
My problem is that the solutions I've come across perform poorly on my dataset.
I've tried :
Calculating the cosine similarity between all the rows of a dataframe in pyspark
Using columnSimilarities() from mllib.linalg.distributed
Reducing dimensionality with PCA
Here is the solution using columnSimilarities()
import pyspark
from pyspark.sql import SparkSession
from pyspark.ml.feature import PCA
from pyspark.mllib.linalg.distributed import IndexedRow, IndexedRowMatrix
from pyspark.sql.functions import row_number
new_df = url_rdd.zip(vector_rdd.map(lambda x:Vectors.dense(x))).toDF(schema=['url','features'])
# PCA
pca = PCA(k=1024, inputCol="features", outputCol="pca_features")
pca_model = pca.fit(new_df)
pca_df = pca_model.transform(new_df)
# Indexing my dataframe
pca_df.createOrReplaceTempView('pca_df')
indexed_df = spark.sql('select row_number() over (order by url) - 1 as id, * from pca_df')
# Computing Cosine Similarity
mat = IndexedRowMatrix(indexed_df.select("id", "pca_features").rdd.map(lambda row: IndexedRow(row.id, row.pca_features.toArray()))).toBlockMatrix().transpose().toIndexedRowMatrix()
cos_mat = mat.columnSimilarities()
Is there a better solution on pyspark to compute the cosine similarity and get the top-n most similar items ?
回答1:
Consider caching new_df
, as you're going over it at least twice (once to fit a model, another time to transform the data).
Additionally, don't forget about the optional threshold you can pass to the columnSimilarities
method.
来源:https://stackoverflow.com/questions/55747128/item-item-recommendation-based-on-cosine-similarity