Learning Data Mining with Python（Second Edition）

上QQ阅读APP看书，第一时间看更新

Ranking to find the best rules

Now that we can compute the support and confidence of all rules, we want to be able to find the best rules. To do this, we perform a ranking and print the ones with the highest values. We can do this for both the support and confidence values.

To find the rules with the highest support, we first sort the support dictionary. Dictionaries do not support ordering by default; the items() function gives us a list containing the data in the dictionary. We can sort this list using the itemgetter class as our key, which allows for the sorting of nested lists such as this one. Using itemgetter(1) allows us to sort based on the values. Setting reverse=True gives us the highest values first:

from operator import itemgetter 
sorted_support = sorted(support.items(), key=itemgetter(1), reverse=True)

We can then print out the top five rules:

sorted_confidence = sorted(confidence.items(), key=itemgetter(1),
                           reverse=True)
for index in range(5):
    print("Rule #{0}".format(index + 1))
    premise, conclusion = sorted_confidence[index][0]
    print_rule(premise, conclusion, support, confidence, features)

The result will look like the following:

Rule #1 
Rule: If a person buys bananas they will also buy milk 
 - Support: 27 
 - Confidence: 0.474 
Rule #2 
Rule: If a person buys milk they will also buy bananas 
 - Support: 27 
 - Confidence: 0.519 
Rule #3 
Rule: If a person buys bananas they will also buy apples 
 - Support: 27 
 - Confidence: 0.474 
Rule #4 
Rule: If a person buys apples they will also buy bananas 
 - Support: 27 
 - Confidence: 0.628 
Rule #5 
Rule: If a person buys apples they will also buy cheese 
 - Support: 22 
 - Confidence: 0.512

Similarly, we can print the top rules based on confidence. First, compute the sorted confidence list and then print them out using the same method as before.

sorted_confidence = sorted(confidence.items(), key=itemgetter(1),
                           reverse=True)
for index in range(5):
    print("Rule #{0}".format(index + 1))
    premise, conclusion = sorted_confidence[index][0]
    print_rule(premise, conclusion, support, confidence, features)

Two rules are near the top of both lists. The first is If a person buys apples, they will also buy cheese, and the second is If a person buys cheese, they will also buy bananas. A store manager can use rules like these to organize their store. For example, if apples are on sale this week, put a display of cheeses nearby. Similarly, it would make little sense to put both bananas on sale at the same time as cheese, as nearly 66 percent of people buying cheese will probably buy bananas -our sale won't increase banana purchases all that much.

Jupyter Notebook will display graphs inline, right in the notebook. Sometimes, however, this is not always configured by default. To configure Jupyter Notebook to display graphs inline, use the following line of code: %matplotlib inline

We can visualize the results using a library called matplotlib.

We are going to start with a simple line plot showing the confidence values of the rules, in order of confidence. matplotlib makes this easy - we just pass in the numbers, and it will draw up a simple but effective plot:

from matplotlib import pyplot as plt 
plt.plot([confidence[rule[0]] for rule in sorted_confidence])

Using the previous graph, we can see that the first five rules have decent confidence, but the efficacy drops quite quickly after that. Using this information, we might decide to use just the first five rules to drive business decisions. Ultimately with exploration techniques like this, the result is up to the user.

Data mining has great exploratory power in examples like this. A person can use data mining techniques to explore relationships within their datasets to find new insights. In the next section, we will use data mining for a different purpose: prediction and classification.