I've heard about the Apriori algorithm several times before but never got the time or the opportunity to dig into it, can anyone explain to me in a simple way the workings of this algorithm? Also, a basic example would make it a lot easier for me to understand.
Well, I would assume you've read the wikipedia entry but you said "a basic example would make it a lot easier for me to understand". Wikipedia has just that so I'll assume you haven't read it and suggest that you do.
Read the wikipedia article.
Apriori Algorithm
It is a candidate-generation-and-test approach for frequent pattern mining in datasets. There are two things you have to remember.
Apriori Pruning Principle - If any itemset is infrequent, then its superset should not be generated/tested.
Apriori Property - A given (k+1)-itemset
is a candidate (k+1)-itemset
only if everyone of its k-itemset
subsets are frequent.
Now, here is the apriori algorithm in 4 steps.
- Initially, scan the database/dataset once to get the frequent
1-itemset
. - Generate length
k+1
candidate itemsets from lengthk
frequent itemsets. - Test the candidates against the database/dataset.
- Terminate when no frequent or candidate set can be generated.
Solved Example
Suppose there is a transaction database as follows with 4 transactions including their transaction IDs and items bought with them. Assume the minimum support - min_sup
is 2
. The term support is the number of transactions in which a certain itemset is present/included.
Transaction DB
tid | items
-------------
10 | A,C,D
20 | B,C,E
30 | A,B,C,E
40 | B,E
Now, let's create the candidate 1-itemsets
by the 1st scan of the DB. It is simply called as the set of C_1
as follows.
itemset | sup
-------------
{A} | 2
{B} | 3
{C} | 3
{D} | 1
{E} | 3
If we test this with min_sup
, we can see {D}
does not satisfy the min_sup
of 2
. So, it will not be included in the frequent 1-itemset
, which we simply call as the set of L_1
as follows.
itemset | sup
-------------
{A} | 2
{B} | 3
{C} | 3
{E} | 3
Now, let's scan the DB for the 2nd time, and generate candidate 2-itemsets
, which we simply call as the set of C_2
as follows.
itemset | sup
-------------
{A,B} | 1
{A,C} | 2
{A,E} | 1
{B,C} | 2
{B,E} | 3
{C,E} | 2
As you can see, {A,B}
and {A,E}
itemsets do not satisfy the min_sup
of 2
and hence they will not be included in the frequent 2-itemset
, L_2
itemset | sup
-------------
{A,C} | 2
{B,C} | 2
{B,E} | 3
{C,E} | 2
Now let's do a 3rd scan of the DB and get candidate 3-itemsets
, C_3
as follows.
itemset | sup
-------------
{A,B,C} | 1
{A,B,E} | 1
{A,C,E} | 1
{B,C,E} | 2
You can see that, {A,B,C}
, {A,B,E}
and {A,C,E}
does not satisfy min_sup
of 2
. So they will not be included in frequent 3-itemset
, L_3
as follows.
itemset | sup
-------------
{B,C,E} | 2
Now, finally, we can calculate the support (supp)
, confidence (conf)
and lift (interestingness value)
values of the Association/Correlation Rules that can be generated by the itemset {B,C,E}
as follows.
Rule | supp | conf | lift
-------------------------------------------
B -> C & E | 50% | 66.67% | 1.33
E -> B & C | 50% | 66.67% | 1.33
C -> E & B | 50% | 66.67% | 1.77
B & C -> E | 50% | 100% | 1.33
E & B -> C | 50% | 66.67% | 1.77
C & E -> B | 50% | 100% | 1.33
See Top 10 algorithms in data mining (free access) or The Top Ten Algorithms in Data Mining. The latter gives a detailed description of the algorithm, together with details on how to get optimized implementations.
The best introduction to Apriori can be downloaded from this book:
http://www-users.cs.umn.edu/~kumar/dmbook/index.php
you can download the chapter 6 for free which explain Apriori very clearly.
Moreover, if you want to download a Java version of Apriori and other algorithms for frequent itemset mining, you can check my website:
来源:https://stackoverflow.com/questions/1248373/apriori-algorithm