问题
I'm using Apyori library as an implementation of the Apriori algorithm.
rules = apriori(trs, min_support = 0.02, min_confidence = 0.1, min_lift = 3)
rules
is a generator and can be converted to a list with res=list(rules)
. For a large dataset, list(rules)
seem to take long time.
Can you help me understand if the rules are sorted in some criterion so that I can retrieve only the top-n most relevant rules? Or, what is the most efficient way to sort the rules
by the lift
for example.
This is what the typical output looks like (i.e. element in the list):
RelationRecord(items=frozenset({'chicken', 'light cream'}),
> support=0.004532728969470737,
> ordered_statistics=[OrderedStatistic(items_base=frozenset({'light cream'}),
> items_add=frozenset({'chicken'}),
> confidence=0.29059829059829057, lift=4.84395061728395)])
回答1:
Can you help me understand if the rules are sorted in some criterion?
tl;dr: They're in ascending order by length and secondarily by how the items in the consequent first appear in your transactions.
Long explanation: Apriori is a breadth first (level-wise) algorithm by default. During the mining step, it first discovers all frequent item sets of length 1, then all frequent itemsets of length 2, then 3, and so on. That means what ultimately determines the order is the order of the single-item candidates. With Apyori items are added to a [Python] list as they're first encountered in the transactions (see the add_transaction() method of the TransactionManager class here).
Rule generation works similarly with regard to consequents that meet the minimum confidence/lift thresholds. For example, for the frequent itemset {a, b, c, d}, we will look at rules (i.e. associations that met our interestingness criteria) that have just one item in the consequent first (e.g. {a, c, d} -> {b}, then {a, b, d} -> {c}), followed by interesting rules with two items in the consequent (e.g. {a, d} -> {b, c}).
What is the most efficient way to sort the rules by the lift for example?
Unfortunately the result of the above explanation means that there really isn't a great way to do this by default. That said, there are a number of modified versions of Apriori and other ARL algorithms that can help with this. To my knowledge, none of those have made it to open source Python projects, however. It sounds like a top-k methodology is what you might be looking for. One approach can be found in this paper. If that's not enough, or is too much effort for your project, you might want to consider other approaches.
If you don't need to exhaustively mine all lengths of associations, I'd suggest looking at collaborative filtering.
来源:https://stackoverflow.com/questions/50479752/apyori-relevance-measure