问题
I'm running the apriori algorithm like this:
rules <-apriori(dt)
inspect(rules)
where dt is my data.frame with this format:
> head(dt)
Cus T C B
1: C1 0 1 1
2: C2 0 1 0
3: C3 0 1 0
4: C4 0 1 0
5: C5 0 1 0
6: C6 0 1 1
The idea of the data set is to capture the customer and whether he\she bought three different items (T, C and B) on a particular purchase. For example, based on the information above, we can see that C1 bought C and B; customers C2 to C5 bought only C and customer C6 bought only C and B.
the output is the following:
lhs rhs support confidence lift
1 {} => {T=0} 0.90 0.9000000 1.0000000
2 {} => {C=1} 0.91 0.9100000 1.0000000
3 {B=0} => {T=0} 0.40 0.8163265 0.9070295
4 {B=0} => {C=1} 0.40 0.8163265 0.8970621
5 {B=1} => {T=0} 0.50 0.9803922 1.0893246
6 {B=1} => {C=1} 0.51 1.0000000 1.0989011
My questions are:
1) how can I get rid of rules where T,C or B are equal to 0. If you think about it, the rule {B=0} => {T=0} or even {B=1} => {T=0} doesn't really make sense.
2)I was reading about the apriori algorithm and in most of the examples, each line represents the actual transactions so in my case, it should be something like:
C,B
C
C
C
C
C, B
instead of my sets of ones and zeros, is that a rule? Or can I still work with my format?
Thanks
回答1:
Not sure what the aim of the program is supposed to be, but the aim of the Apriori algorithm is first to extract frequent itemsets of a given data, in which frequent itemsets are a certain quantity of items which often appear as such quantity in the data. And second to generate of those extracted frequent itemsets association rules. An association rule looks for example like this:
B -> C
Which in the stated case means, that customers who bought B buys C too to a certain probability. Whereby the probability is determined by the support and confidence level of the Apriori algorithm. The support level regulates the amount of frequent itemsets and the confidence level the amount of association rules. Association rules over the confidence are called strong association rules.
Do not understand against this backdrop why for the determination whether a customer bought different articles the Apriori algorithm is used. This could be answered by an if statement. And the provided output makes no sense in this context. The output says for example for the third line that if a customer does not buy B then he buys not T with a support of 40% and a confidence of 81.6%. Apart of that association rules does not have a support, only the association rule B -> C is correct, but it's confidence value wrong.
Nevertheless, if the aim is to generate described association rules the original Apriori cannot operate an input in this format:
> head(dt)
Cus T C B
1: C1 0 1 1
2: C2 0 1 0
3: C3 0 1 0
4: C4 0 1 0
5: C5 0 1 0
6: C6 0 1 1
For the uncustomized Apriori algorithm a data set needs this format:
> head(dt)
C1: {B, C}
C2: {C}
C3: {C}
C4: {C}
C5: {C}
C6: {B, C}
See two solutions: Either to format the input wherever or to customize the Apriori algorithm to this format what would be argubaly a change of the input format within the algorithm. To clarify the need of the stated input format, the Apriori algorithm in a nutshell with the provided data:
Support level = 0.3
Confidence level = 0.3
Number of customers = 6
Total number of B's bought = 2
Total number of C's bought = 6
Support of B = 2 / 6 = 0.3 >= 0.3 = support level
Support of C = 6 / 6 = 1 >= 0.3 = support level
Support of B, C = 2 / 6 = 0.3 >= 0.3 = support level
-> Frequent itemsets = {B, C, BC}
-> Association rules = {B -> C}
Confidence of B -> C = 2 / 2 = 1 >= 0.3 = confidence level
-> Strong association rules = {B -> C}
Hope this helps.
来源:https://stackoverflow.com/questions/29608768/r-association-rules-apriori