Learning Data Mining with Python(Second Edition)
上QQ阅读APP看书,第一时间看更新

Implementing a simple ranking of rules

We wish to find rules of the type If a person buys product X, then they are likely to purchase product Y. We can quite easily create a list of all the rules in our dataset by simply finding all occasions when two products are purchased together. However, we then need a way to determine good rules from bad ones allowing us to choose specific products to recommend.

We can evaluate rules of this type in many ways, on which we will focus on two: support and confidence.

Support is the number of times that a rule occurs in a dataset, which is computed by simply counting the number of samples for which the rule is valid. It can sometimes be normalized by piding by the total number of times the premise of the rule is valid, but we will simply count the total for this implementation.

The premise is the requirements for a rule to be considered active. The conclusion is the output of the rule. For the example if a person buys an apple, they also buy a banana, the rule is only valid if the premise happens - a person buys an apple. The rule's conclusion then states that the person will buy a banana.

While the support measures how often a rule exists, confidence measures how accurate they are when they can be used. You can compute this by determining the percentage of times the rule applies when the premise applies. We first count how many times a rule applies to our data and pide it by the number of samples where the premise (the if statement) occurs.

As an example, we will compute the support and confidence for the rule if a person buys apples, they also buy bananas.

As the following example shows, we can tell whether someone bought apples in a transaction, by checking the value of sample[3], where we assign a sample to a row of our matrix:

sample = X[2]

Similarly, we can check if bananas were bought in a transaction by seeing if the value of sample[4] is equal to 1 (and so on). We can now compute the number of times our rule exists in our dataset and, from that, the confidence and support.

Now we need to compute these statistics for all rules in our database. We will do this by creating a dictionary for both valid rules and invalid rules. The key to this dictionary will be a tuple (premise and conclusion). We will store the indices, rather than the actual feature names. Therefore, we would store (3 and 4) to signify the previous rule If a person buys apples, they will also buy bananas. If the premise and conclusion are given, the rule is considered valid. While if the premise is given but the conclusion is not, the rule is considered invalid for that sample.

The following steps will help us to compute the confidence and support for all possible rules:

  1. We first set up some dictionaries to store the results. We will use defaultdict for this, which sets a default value if a key is accessed that doesn't yet exist. We record the number of valid rules, invalid rules, and occurrences of each premise:
from collections import defaultdict 
valid_rules = defaultdict(int)
invalid_rules = defaultdict(int)
num_occurences = defaultdict(int)
  1. Next, we compute these values in a large loop. We iterate over each sample in the dataset and then loop over each feature as a premise. When again loop over each feature as a possible conclusion, mapping the relationship premise to conclusion. If the sample contains a person who bought the premise and the conclusion, we record this in valid_rules. If they did not purchase the conclusion product, we record this in invalid_rules.
  2. For sample in X:
for sample in X:
for premise in range(n_features):
if sample[premise] == 0: continue
# Record that the premise was bought in another transaction
num_occurences[premise] += 1
for conclusion in range(n_features):
if premise == conclusion:
# It makes little sense to
measure if X -> X.
continue
if sample[conclusion] == 1:
# This person also bought the conclusion item
valid_rules[(premise, conclusion)] += 1

If the premise is valid for this sample (it has a value of 1), then we record this and check each conclusion of our rule. We skip over any conclusion that is the same as the premise-this would give us rules such as: if a person buys Apples, then they buy Apples, which obviously doesn't help us much.

We have now completed computing the necessary statistics and can now compute the support and confidence for each rule. As before, the support is simply our valid_rules value:

support = valid_rules

We can compute the confidence in the same way, but we must loop over each rule to compute this:

confidence = defaultdict(float)
for premise, conclusion in valid_rules.keys():
rule = (premise, conclusion)
confidence[rule] = valid_rules[rule] / num_occurences [premise]

We now have a dictionary with the support and confidence for each rule. We can create a function that will print out the rules in a readable format. The signature of the rule takes the premise and conclusion indices, the support and confidence dictionaries we just computed, and the features array that tells us what the features mean. Then we print out the Support and Confidence of this rule:

for premise, conclusion in confidence:
premise_name = features[premise]
conclusion_name = features[conclusion]
print("Rule: If a person buys {0} they will also
buy{1}".format(premise_name, conclusion_name))
print(" - Confidence: {0:.3f}".format
(confidence[(premise,conclusion)]))
print(" - Support: {0}".format(support
[(premise,
conclusion)]))
print("")

We can test the code by calling it in the following way-feel free to experiment with different premises and conclusions:

for premise, conclusion in confidence:
premise_name = features[premise]
conclusion_name = features[conclusion]
print("Rule: If a person buys {0} they will also
buy{1}".format(premise_name, conclusion_name))
print(" - Confidence: {0:.3f}".format
(confidence[(premise,conclusion)]))
print(" - Support: {0}".format(support
[(premise,
conclusion)]))
print("")