问题
(This is no homework and no work issue. It's just my personal interest/occupation and completly fictional. But I am interested in a good algorithm or data structure.)
Let's assume, that I would run a dating site. And my special feature would be that the singles were matched by movie taste. (Why not?)
In that case I would need a way to store the movie ratings for each user. (So far no problem.) And I would need a data structure to find the best fitting user. The distance between two taste patterns would be the average distance between all ratings that both users made.
Example
movies A B C D E F G H I J K L M ...
user Xm 9 5 1 1 5
user Ym 4 6 1 8
user Zf 9 6 4 7
Distance(X,Z) = avg( abs(9-9) + abs(1-4) ) = 1.5
Distance(Y,Z) = avg( abs(4-6) + abs(6-4) + abs(8-7) ) = 1.666
So Mr. X fits slightly better to Mrs. Z, than Mr. Y does.
I like soulution that ...
- ... don't need many operations on the database
- ... don't need to handle a lot of data
- ... run fast
- ... deliver the best matching
- Ok, maybe I would consider good approximations too.
Try to keep in mind that this should also work with thousands of possible movies, users that rate only about 20-50 movies, and thousands of users.
(Because this is a mental puzzle and not a real problem, work-arrounds are not really helping.)
What would be your search algorithm or data structure?
回答1:
Sounds a lot like the Netflix Prize challenge, more specifically the first half of the most popular approach. The possible implementations of what you are trying to do are numerous and varied. None of them are exceptionally efficient, and the L1 metric is not a particularly good option for reliable correlations.
回答2:
Looks like you are looking for the nearest neighbor in the movie space. And your distance function is the L1 metric. You can probably use a spatial index of some kind. Maybe you can use techniques from collaborative filtering.
回答3:
CREATE TABLE data (user INTEGER, movie INTEGER, rate INTEGER);
SELECT other.user, AVG(ABS(d1.rate - d2.rate)) AS distance
FROM data me, data other
WHERE me.user = :user
AND other.user <> me.user
AND other.movie = me.movie
GROUP BY
other.user
ORDER BY
distance
Complexity will be O(n1.5)) rather than O(n2), as there will be n
comparisons to sqrt(n)
movies (average of movies filled together by each pair).
来源:https://stackoverflow.com/questions/462563/how-to-store-sets-to-find-similar-patterns-fast