Apache Spark ALS collaborative filtering results. They don't make sense

∥☆過路亽.° 提交于 2019-12-03 07:57:33

Note that the code you are running does not use implicit feedback, and is not quite the algorithm you refer to. Just make sure you are not using ALS.trainImplicit. You may need a different, lambda and rank. RMSE of 0.88 is "OK" for this data set; I am not clear that the example's values are optimal or just the one that the toy test produced. You use a different value still here. Maybe it's just not optimal yet.

It could even be stuff like bugs in the ALS implementation fixed since. Try comparing to another implementation of ALS if you can.

I always try to resist rationalizing the recommendations since our brains inevitably find some explanation even for random recommendations. But, hey, I can say that you did not get action, horror, crime drama, thrillers here. I find that kids movies go hand in hand with taste for arty movies, since, the kind of person who filled out their tastes for MovieLens way back when and rated kids movies were not actually kids, but parents, and maybe software engineer types old enough to have kids do tend to watch these sorts of foreign films you see.

Collaborative Filtering just give you items that people, who have the same taste as you, really like. If you rate only kids movies, it doesn't mean that you will get recommended only kids movies. It just means that people who rated Toy Story, Jungle Book, Lion King, etc... as you did also like Life of Oharu, More, Who's Singin' Over There?, etc... You have a good animation on the wikipedia page: CF

I didn't read the link that you gave but one thing that you can change is the similarity measure you are using if you want to stay with collaborative filtering.

If you want recommendation based on your taste, you might try latent factor model like Matrix Factorization. Here the latent factor might discover that movie can be describe as features that describe the characteristics of rated objects. It might be that a movie is comic, children, horror, etc.. (You never really know what the latent factor are by the way). And if you only rate kids movies, you might get as recommendation others kids movies.

Hope it helps.

Second what Vlad said, try correlation or Jaccard. I.e. ignore the rating numbers and just look at the binary "are these two movies together in a user's preference list or not". This was a game-changer for me when I was building my first recommender: http://tdunning.blogspot.com/2008/03/surprise-and-coincidence.html

Good luck

Charles O

I have tried using the same dataset and following this Spark tutorial, I get the same (subjectively bad) results.

However, using a simpler method - for instance based on Pearson Correlation as a similarity measure - instead of matrix factorization, I get much, much better results. This means I would mostly get kid movies with your input preferences and the same input ratings file.

Unless you really need the factorization (which has a lot of advantages, though), I would suggest using another recommendation method.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!