问题
Assume we have 3 rules:
[1] {A,B,D} -> {C}
[2] {A,B} -> {C}
[3] Whatever it is
Rule [2]
is a subset of rule [1]
(because rule [1]
contains all the items in rule [2]
), so rule [1]
should be eliminated (because rule [1]
is too specific and its information is included in rule [2]
)
I searched through the internet and everyone is using these code to remove redundant rules:
subset.matrix <- is.subset(rules.sorted, rules.sorted)
subset.matrix[lower.tri(subset.matrix, diag=T)] <- NA
redundant <- colSums(subset.matrix, na.rm=T) >= 1
which(redundant)
rules.pruned <- rules.sorted[!redundant]
I dont understand how the code work.
After line 2 of the code, the subset.matrix will become:
[,1] [,2] [,3]
[1,] NA 1 0
[2,] NA NA 0
[3,] NA NA NA
The cells in the lower triangle are set to be NA and since rule [2]
is a subset of rule [1]
, the corresponding cell is set to 1. So I have 2 questions:
Why do we have to set the lower triangle as NA? If we do so then how can we check whether rule
[2]
is subset of rule[3]
or not? (the cell has been set as NA)In our case, rule
[1]
should be the one to be eliminated, but these code eliminate rule[2]
instead of rule[1]
. (Because the first cell in column 2 is 1, and according to line 3 of the code, the column sums of column 2 >= 1, therefore will be treated as redundant)
Any help would be appreciated !!
回答1:
For your code to work you need an interest measure (confidence or lift) and rules.sorted
needs to be sorted by either confidence or lift. Anyway, the code is horribly inefficient since is.subset()
creates a matrix of size n^2, where n is the number of rules. Also, is.subset
for rules merges rhs and lhs of the rule which is not correct. So don't worry too much about the implementation details.
A more efficient way to do this is now implemented as function is.redundant()
in package arules (available in version 1.4-2).
This explanation comes from the manual page:
A rule is redundant if a more general rules with the same or a higher confidence exists. That is, a more specific rule is redundant if it is only equally or even less predictive than a more general rule. A rule is more general if it has the same RHS but one or more items removed from the LHS. Formally, a rule X -> Y is redundant if
for some X' subset X, conf(X' -> Y) >= conf(X -> Y).
This is equivalent to a negative or zero improvement as defined by Bayardo et al. (2000). In this implementation other measures than confidence, e.g. improvement of lift, can be used as well.
Check out the examples in ? is.redundant
.
回答2:
Remove redundant rules with arules package...
Run apriori algorithm:
rules <- apriori(transDat, parameter = list(supp = 0.01, conf = 0.5, target = "rules", maxlen = 3))
Remove redundant:
rules <- rules[!is.redundant(rules)]
Inspect:
arules::inspect(rules)
Create a dataframe:
df = data.frame(
lhs = labels(lhs(rules)),
rhs = labels(rhs(rules)),
rules@quality)
回答3:
Just check out help for is.redundant() in rstudio, It clearly states that
Suppose there is a
rule1 X->Y with confidence cf1
rule2 X' -> Y with confidence cf2 where X' is a subset of X
rule1 is said to be redundant if rule2 has a higher confidence than rule1 i.e cf2 > cf1 (where X' is a subset of X)
i.e if there is a rule where subset of lhs can give rhs with more confidence then prior rule is said to be redundant rule.
We make lower triangle as na so that the rule doesn't become subset of itself
Insufficient information, rules cant be said redundant just on basis of subsetting, confidence value has to be taken in account
来源:https://stackoverflow.com/questions/38796256/association-rule-in-r-removing-redundant-rule-arules