问题
I'm using SAS for a piece of coursework. At the moment, I have a set of Order IDs and Product IDs. I want to found out which products are most frequently ordered together. Think, milk and cereal in a grocery basket.
I am not very good at programming, so would really appreciate if anyone could spare a bit of time and write a simple few lines of SQL I can easily use. Its not a heavy dataset and there are only two columns (Order_ID and Product_ID)
For example:
Order ID Product ID
10001 64564564
10001 546456
10001 54646
10003 5464
10003 342346
I've spent three hours researching now and am a bit desperate :(
回答1:
If you think about it, you can find the answer by asking the question this way: for every possible pair of products, how many times did the two products occur on the same order. Then order by the count to float the answer(s) to the top:
select
p1.product_id, p2.product_id, count(*) times_order_together
from
orders p1
inner join
orders p2
on
p1.order_id = p2.order_id
and
p1.product_id != p2.product_id
group by
p1.product_id, p2.product_id
order by
count(*) desc
Products that weren't ever ordered together don't show up at all. Also - pairs are represented twice - a row for eggs with milk and a row for milk with eggs. These duplicate pairs are removable - but it gets uglier - and simple is good.
To elaborate a bit, p1
and p2
are aliases of orders. You do that to be able to use a data source more than once - and yet distinguish between them. Also, the count(*) times_order_together
is just giving the name 'times_order_together' to the calculation count(*)
. It's counting the number of times a product pairing occurs in an order.
回答2:
how about something like:
create table order_together (order_id, product_id1, product_id2);
insert into order_together
(order_id, product_id1, product_id2)
select o1.order_id, o1.product_id, o2.product_id
from order_line o1, order_line o2
where o1.order_id = o2.order_id
/* you dont want them equal and you also dont
want to insert cereal-milk and milk-cereal on the same order*/
and o1.product_id < o2.product_id
now you have pairs of products together and you can go wild with counts and stats. Mind you, this is quite naive and would blow up in volume quite quickly.
Maybe
select count(distinct order_id), o1.product_id, o2.product_id
...
group by o1.product_id, o2.product_id
would be better.
in response to be comment
but you are grabbing pairs of ordered together products, coming from different rows of the same order's order_lines.
Try this on sqlfiddle.com
put this in left, build schema pane. it creates the tables.
create table order_line(order_no int, product_id varchar(10));
create table order_together(order_no int, product_id1 varchar(10), product_id2 varchar(10));
put this in right pane, Run SQL
insert into order_line(order_no, product_id) values(1, 'milk');
insert into order_line(order_no, product_id) values (1, 'cereal');
insert into order_line(order_no, product_id) values (1, 'rice');
insert into order_line(order_no, product_id) values (2, 'milk');
insert into order_line(order_no, product_id) values (2, 'cereal');
insert into order_line(order_no, product_id) values (3, 'milk');
insert into order_line(order_no, product_id) values (3, 'cookies');
insert into order_line(order_no, product_id) values(4, 'milk');
insert into order_line(order_no, product_id) values (4, 'cookies');
insert into order_line(order_no, product_id) values(5, 'rice');
insert into order_line(order_no, product_id) values (5, 'icecream');
select o1.order_no, o1.product_id as product_from_row1, o2.product_id as product_from_row2
from order_line o1, order_line o2
where o1.order_no = o2.order_no
and o1.product_id < o2.product_id
gives:
order_no product_from_row1 product_from_row2
1 milk rice
1 cereal milk
1 cereal rice
2 cereal milk
3 cookies milk
4 cookies milk
5 icecream rice
give it a try, then think about what the query is requesting, which is joining different order_lines of the same order. That's pretty much the definition of "ordered together".
来源:https://stackoverflow.com/questions/29722055/frequent-itemset-sql