Frequent itemset SQL | 易学教程

问题

I'm using SAS for a piece of coursework. At the moment, I have a set of Order IDs and Product IDs. I want to found out which products are most frequently ordered together. Think, milk and cereal in a grocery basket.

I am not very good at programming, so would really appreciate if anyone could spare a bit of time and write a simple few lines of SQL I can easily use. Its not a heavy dataset and there are only two columns (Order_ID and Product_ID)

For example:

Order ID Product ID

10001 64564564

10001 546456

10001 54646

10003 5464

10003 342346

I've spent three hours researching now and am a bit desperate :(

回答1:

If you think about it, you can find the answer by asking the question this way: for every possible pair of products, how many times did the two products occur on the same order. Then order by the count to float the answer(s) to the top:

select 
    p1.product_id, p2.product_id, count(*) times_order_together 
from
    orders p1
    inner join
    orders p2
    on
        p1.order_id = p2.order_id
        and
        p1.product_id != p2.product_id 
group by
    p1.product_id, p2.product_id
order by
    count(*) desc

Products that weren't ever ordered together don't show up at all. Also - pairs are represented twice - a row for eggs with milk and a row for milk with eggs. These duplicate pairs are removable - but it gets uglier - and simple is good.

To elaborate a bit, p1 and p2 are aliases of orders. You do that to be able to use a data source more than once - and yet distinguish between them. Also, the count(*) times_order_together is just giving the name 'times_order_together' to the calculation count(*). It's counting the number of times a product pairing occurs in an order.

回答2:

how about something like:

create table order_together (order_id, product_id1, product_id2);

insert into order_together
(order_id, product_id1, product_id2)
select o1.order_id, o1.product_id, o2.product_id
from order_line o1, order_line o2
where o1.order_id = o2.order_id 

/* you dont want them equal and you also dont
want to insert cereal-milk and milk-cereal on the same order*/
and o1.product_id < o2.product_id

now you have pairs of products together and you can go wild with counts and stats. Mind you, this is quite naive and would blow up in volume quite quickly.

Maybe

select count(distinct order_id), o1.product_id, o2.product_id 
... 
group by o1.product_id, o2.product_id

would be better.

in response to be comment

but you are grabbing pairs of ordered together products, coming from different rows of the same order's order_lines.

Try this on sqlfiddle.com

put this in left, build schema pane. it creates the tables.

create table order_line(order_no int, product_id varchar(10));

create table order_together(order_no int, product_id1 varchar(10), product_id2 varchar(10));

put this in right pane, Run SQL

insert into order_line(order_no, product_id) values(1, 'milk');
insert into order_line(order_no, product_id) values (1, 'cereal');
insert into order_line(order_no, product_id) values (1, 'rice');
insert into order_line(order_no, product_id) values (2, 'milk');
insert into order_line(order_no, product_id) values (2, 'cereal');

insert into order_line(order_no, product_id) values (3, 'milk');
insert into order_line(order_no, product_id) values (3, 'cookies');
insert into order_line(order_no, product_id) values(4, 'milk');
insert into order_line(order_no, product_id) values (4, 'cookies');
insert into order_line(order_no, product_id) values(5, 'rice');
insert into order_line(order_no, product_id) values (5, 'icecream');

select o1.order_no, o1.product_id as product_from_row1, o2.product_id as product_from_row2
from order_line o1, order_line o2
where o1.order_no = o2.order_no
and o1.product_id < o2.product_id

gives:

order_no    product_from_row1   product_from_row2
1   milk    rice
1   cereal  milk
1   cereal  rice
2   cereal  milk
3   cookies     milk
4   cookies     milk
5   icecream    rice

give it a try, then think about what the query is requesting, which is joining different order_lines of the same order. That's pretty much the definition of "ordered together".

来源：https://stackoverflow.com/questions/29722055/frequent-itemset-sql

标签

sql

frequency