问题
This is how user-item-rating list looks like as a pandas dataframe.
item_id rating user_id
0 aaaaaaa 5 X
1 bbbbbbb 2 Y
2 ccccccc 5 Z
3 ddddddd 1 T
This how I create user-item-matrix in pandas and it only takes a couple of seconds with real dataset (about 500k row):
user_item_matrix = df.pivot(index = 'user_id', columns ='item_id', values = 'rating')
item_id aaaaaaa bbbbbbb ccccccc ddddddd
user_id
T NaN NaN NaN 1.0
X 5.0 NaN NaN NaN
Y NaN 2.0 NaN NaN
Z NaN NaN 5.0 NaN
I am trying this approach to achieve same result with pyspark dataframe.
from pyspark.sql.functions import first
df.groupby('user_id') \
.pivot('item_id') \
.agg(first('rating'))
But it takes ages to complete with real data. Is there a smarter/faster way to achieve this? Basically I am trying to build an user-item matrix from an user-item-rating list.
回答1:
This is an alternative approach that is RDD based.
rating_list = [['aaa',5.0,'T'],['bbb',5.0,'U'],['ccc',5.0,'V'],['ddd',5.0,'W'],['eee',5.0,'X']]
df = sc.parallelize(rating_list).toDF(['item_id','rating','user_id'])
df.show()
+-------+------+-------+
|item_id|rating|user_id|
+-------+------+-------+
| aaa| 5.0| T|
| bbb| 5.0| U|
| ccc| 5.0| V|
| ddd| 5.0| W|
| eee| 5.0| X|
+-------+------+-------+
items = df.select('item_id').rdd.map(lambda data:data.item_id).collect()
item_len = len(items)
def transformRating(item_id,rating,user_id):
rating_list = [rating if ele == item_id else None for ele in items]
return ([user_id]+rating_list)
df1 = (df.rdd.map(lambda data:(data.item_id,data.rating,data.user_id))
.map(lambda (item,rat,uid):transformRating(item,rat,uid))
.toDF(['uid']+items))
df1.show()
+---+----+----+----+----+----+
|uid| aaa| bbb| ccc| ddd| eee|
+---+----+----+----+----+----+
| T| 5.0|null|null|null|null|
| U|null| 5.0|null|null|null|
| V|null|null| 5.0|null|null|
| W|null|null|null| 5.0|null|
| X|null|null|null|null| 5.0|
+---+----+----+----+----+----+
Now I would assume that one user might rate multiple items. In that case you might need to reduce the RDD based on user_id and combine rating. It will just be one more reduceByKey statement before .toDF and you should get a df like that.
来源:https://stackoverflow.com/questions/44594456/converting-user-item-rating-list-to-user-item-matrix-with-pyspark