converting user-item-rating list to user-item-matrix with pyspark

问题

This is how user-item-rating list looks like as a pandas dataframe.

   item_id  rating user_id
0  aaaaaaa       5       X
1  bbbbbbb       2       Y
2  ccccccc       5       Z
3  ddddddd       1       T

This how I create user-item-matrix in pandas and it only takes a couple of seconds with real dataset (about 500k row):

user_item_matrix = df.pivot(index = 'user_id', columns ='item_id', values = 'rating')

item_id  aaaaaaa  bbbbbbb  ccccccc  ddddddd
user_id                                    
T            NaN      NaN      NaN      1.0
X            5.0      NaN      NaN      NaN
Y            NaN      2.0      NaN      NaN
Z            NaN      NaN      5.0      NaN

I am trying this approach to achieve same result with pyspark dataframe.

from pyspark.sql.functions import first

df.groupby('user_id') \
  .pivot('item_id') \
  .agg(first('rating'))

But it takes ages to complete with real data. Is there a smarter/faster way to achieve this? Basically I am trying to build an user-item matrix from an user-item-rating list.

回答1:

This is an alternative approach that is RDD based.

rating_list = [['aaa',5.0,'T'],['bbb',5.0,'U'],['ccc',5.0,'V'],['ddd',5.0,'W'],['eee',5.0,'X']]
df = sc.parallelize(rating_list).toDF(['item_id','rating','user_id'])
df.show()

+-------+------+-------+
|item_id|rating|user_id|
+-------+------+-------+
|    aaa|   5.0|      T|
|    bbb|   5.0|      U|
|    ccc|   5.0|      V|
|    ddd|   5.0|      W|
|    eee|   5.0|      X|
+-------+------+-------+


items = df.select('item_id').rdd.map(lambda data:data.item_id).collect()
item_len = len(items)

def transformRating(item_id,rating,user_id):
    rating_list = [rating if ele == item_id else None for ele in items]
    return ([user_id]+rating_list)

df1 = (df.rdd.map(lambda data:(data.item_id,data.rating,data.user_id))
             .map(lambda (item,rat,uid):transformRating(item,rat,uid))
             .toDF(['uid']+items))

df1.show()

+---+----+----+----+----+----+
|uid| aaa| bbb| ccc| ddd| eee|
+---+----+----+----+----+----+
|  T| 5.0|null|null|null|null|
|  U|null| 5.0|null|null|null|
|  V|null|null| 5.0|null|null|
|  W|null|null|null| 5.0|null|
|  X|null|null|null|null| 5.0|
+---+----+----+----+----+----+

Now I would assume that one user might rate multiple items. In that case you might need to reduce the RDD based on user_id and combine rating. It will just be one more reduceByKey statement before .toDF and you should get a df like that.

来源：https://stackoverflow.com/questions/44594456/converting-user-item-rating-list-to-user-item-matrix-with-pyspark

标签

pandas

apache-spark

pyspark

apache-spark-sql

pyspark-sql