Item-item recommendation based on cosine similarity

こ雲淡風輕ζ 提交于 2020-12-07 07:17:42

问题


As a part of a recommender system that I am building, I want to implement a item-item recommendation based on cosine similarity. Ideally, I would like to compute the cosine similarity on 1 million items represented by a DenseVector of 2048 features in order to get the top-n most similar items to a given one.

My problem is that the solutions I've come across perform poorly on my dataset.

I've tried :

  • Calculating the cosine similarity between all the rows of a dataframe in pyspark

  • Using columnSimilarities() from mllib.linalg.distributed

  • Reducing dimensionality with PCA

Here is the solution using columnSimilarities()

import pyspark
from pyspark.sql import SparkSession
from pyspark.ml.feature import PCA
from pyspark.mllib.linalg.distributed import IndexedRow, IndexedRowMatrix
from pyspark.sql.functions import row_number

new_df = url_rdd.zip(vector_rdd.map(lambda x:Vectors.dense(x))).toDF(schema=['url','features'])

# PCA
pca = PCA(k=1024, inputCol="features", outputCol="pca_features")
pca_model = pca.fit(new_df)
pca_df = pca_model.transform(new_df)

# Indexing my dataframe
pca_df.createOrReplaceTempView('pca_df')
indexed_df = spark.sql('select row_number() over (order by url) - 1 as id, * from pca_df')

# Computing Cosine Similarity
mat = IndexedRowMatrix(indexed_df.select("id", "pca_features").rdd.map(lambda row: IndexedRow(row.id, row.pca_features.toArray()))).toBlockMatrix().transpose().toIndexedRowMatrix()
cos_mat = mat.columnSimilarities()

Is there a better solution on pyspark to compute the cosine similarity and get the top-n most similar items ?


回答1:


Consider caching new_df, as you're going over it at least twice (once to fit a model, another time to transform the data).

Additionally, don't forget about the optional threshold you can pass to the columnSimilarities method.



来源:https://stackoverflow.com/questions/55747128/item-item-recommendation-based-on-cosine-similarity

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!