I am trying to code an algorithm that acts a bit like Amazon\'s \"People who bought this also bought\".
The difference between the two is that mine is just counting the
I was able to achieve your desired result using a simple DB structure, and a pretty simple query:
TABLE `exa`
| sesh_id | prod_id |
---------------------
| 1 | 1 |
| 1 | 2 |
| 1 | 3 |
| 1 | 4 |
| 2 | 2 |
| 2 | 3 |
| 2 | 4 |
| 3 | 3 |
| 3 | 4 |
| 4 | 1 |
| 4 | 2 |
| 4 | 5 |
SELECT c.prod_id, COUNT(*)
FROM `exa` a
JOIN `exa` b ON a.prod_id=b.prod_id
JOIN `exa` c ON b.sesh_id=c.sesh_id
WHERE a.`prod_id`=3 AND c.prod_id!=3
GROUP BY c.prod_id
ORDER BY 2 DESC;
| prod_id | COUNT |
| 4 | 9 |
| 2 | 6 |
| 1 | 3 |
The idea is that every time a session views a product, it gets inserted into the table [in this case exa
]
Then, on any particular product view, you can check and see what other products people who viewed this product also viewed, weighted by frequency. So in this particular example, EVERYONE that viewed product #3 viewed product #4, so it comes up first in the sort. Product #5 was only viewed by session #4, and session #4 didn't view product #3, so product #5 doesn't come up in the results.
You are not going to need any (any!) +1
s.
What you do need is a history. In case of "customers who bought X also bought Y" it's a history of purchases. In case of "customers who saw X also were interested in Y" it's a history of who saw what.
As soon as you have a history, you're ready for making sense of your data. Here, you'll need to adjust your mind to the nearest neighbour problem. What you need is the nearest neighbours for X. The coordinates are users, the values are 0 and 1 depending on whether a user saw/bought an item.
The easiest way to calculate the distance is just sum of squared deltas; it can be easily calculated once an hour for example (as soon as you have plenty of views, distances will stop changing too often), and after that you'll always have a table of distances handy.
The examples for this approach can be found (in Python) in Programming Collective Intelligence published by O'Reilly
I would make one improvement to your idea. When you're figuring out the cliques that go together and deciding which make the strongest relationships you should add a weight against each connection. The easiest way to calculate the weight would be to see how many people who looked at product X also looked at Y. The more views, the stronger the relationship.
The NetFlix prize was won with a solution based on SVD. Implementing a covariance matrix in a database table is a challenge. Implementing SVD in a database might be a research problem. But most people would call it insane.
Ok. I think i figured it out. part of the work is code implementation.
What i did was to group by the sessionID, productID.
then, in my code I iterate thought each sessionID, and I make a dictionary with pairs.
for instance, if I have pid 10 and 20 and 30, which are a clique basically.
so I insert to the dictionary in the following way:
1. 10-20, weight 1
2. 20-10, weight 1
3. 10-30, weight 1
4. 30-10, weight 1
5. 20-30, eight 1.
6. 30-20, weight 1.
in case I encounter one of the values again, i will add +1 to the appropriate pair/s.
at the end, i will have the weights and pairs aligned.
all I have to do now is by a given productID, to scan the table and find the clique it is inside.
if you have improvements suggestions, please let me know!
thanks all!