How to store sets, to find similar patterns fast?

问题

(This is no homework and no work issue. It's just my personal interest/occupation and completly fictional. But I am interested in a good algorithm or data structure.)

Let's assume, that I would run a dating site. And my special feature would be that the singles were matched by movie taste. (Why not?)

In that case I would need a way to store the movie ratings for each user. (So far no problem.) And I would need a data structure to find the best fitting user. The distance between two taste patterns would be the average distance between all ratings that both users made.

Example

movies   A B C D E F G H I J K L M ...
user Xm  9 5   1   1   5
user Ym      4 6 1         8
user Zf  9   6 4           7

Distance(X,Z) = avg( abs(9-9) + abs(1-4) ) = 1.5

Distance(Y,Z) = avg( abs(4-6) + abs(6-4) + abs(8-7) ) = 1.666

So Mr. X fits slightly better to Mrs. Z, than Mr. Y does.

I like soulution that ...

... don't need many operations on the database
... don't need to handle a lot of data
... run fast
... deliver the best matching
Ok, maybe I would consider good approximations too.

Try to keep in mind that this should also work with thousands of possible movies, users that rate only about 20-50 movies, and thousands of users.

(Because this is a mental puzzle and not a real problem, work-arrounds are not really helping.)

What would be your search algorithm or data structure?

回答1:

Sounds a lot like the Netflix Prize challenge, more specifically the first half of the most popular approach. The possible implementations of what you are trying to do are numerous and varied. None of them are exceptionally efficient, and the L1 metric is not a particularly good option for reliable correlations.

回答2:

Looks like you are looking for the nearest neighbor in the movie space. And your distance function is the L1 metric. You can probably use a spatial index of some kind. Maybe you can use techniques from collaborative filtering.

回答3:

CREATE TABLE data (user INTEGER, movie INTEGER, rate INTEGER);

SELECT  other.user, AVG(ABS(d1.rate - d2.rate)) AS distance
FROM    data me, data other
WHERE   me.user = :user
    AND other.user <> me.user
    AND other.movie = me.movie
GROUP BY
    other.user
ORDER BY
    distance

Complexity will be O(n^1.5)) rather than O(n²), as there will be n comparisons to sqrt(n) movies (average of movies filled together by each pair).

来源：https://stackoverflow.com/questions/462563/how-to-store-sets-to-find-similar-patterns-fast

标签

algorithm

data-structures

pattern-matching

puzzle