Learning Data Mining with Python（Second Edition）

上QQ阅读APP看书，第一时间看更新

Implementing the Apriori algorithm

On the first iteration of Apriori, the newly discovered itemsets will have a length of 2, as they will be supersets of the initial itemsets created in the first step. On the second iteration (after applying the fourth step and going back to step 2), the newly discovered itemsets will have a length of 3. This allows us to quickly identify the newly discovered itemsets, as needed in the second step.

We can store our discovered frequent itemsets in a dictionary, where the key is the length of the itemsets. This allows us to quickly access the itemsets of a given length, and therefore the most recently discovered frequent itemsets, with the help of the following code:

frequent_itemsets = {}

We also need to define the minimum support needed for an itemset to be considered frequent. This value is chosen based on the dataset but try different values to see how that affects the result. I recommend only changing it by 10 percent at a time though, as the time the algorithm takes to run will be significantly different! Let's set a minimum support value:

min_support = 50

To implement the first step of the Apriori algorithm, we create an itemset with each movie inpidually and test if the itemset is frequent. We use frozenset, as they allow us to perform faster set-based operations later on, and they can also be used as keys in our counting dictionary (normal sets cannot).

Let's look at the following example of frozenset code:

frequent_itemsets[1] = dict((frozenset((movie_id,)), row["Favorable"])
 for movie_id, row in num_favorable_by_movie.iterrows()
 if row["Favorable"] > min_support)

We implement the second and third steps together for efficiency by creating a function that takes the newly discovered frequent itemsets, creates the supersets, and then tests if they are frequent. First, we set up the function to perform these steps:

from collections import defaultdict

def find_frequent_itemsets(favorable_reviews_by_users, k_1_itemsets, min_support):
    counts = defaultdict(int)
    for user, reviews in favorable_reviews_by_users.items():
        for itemset in k_1_itemsets:
            if itemset.issubset(reviews):
                for other_reviewed_movie in reviews - itemset:
                    current_superset = itemset | frozenset((other_reviewed_movie,))
                    counts[current_superset] += 1
    return dict([(itemset, frequency) for itemset, frequency in counts.items() if frequency >= min_support])

In keeping with our rule of thumb of reading through the data as little as possible, we iterate over the dataset once per call to this function. While this doesn't matter too much in this implementation (our dataset is relatively small compared to the average computer), single-pass is a good practice to get into for larger applications.

Let's have a look at the core of this function in detail. We iterate through each user, and each of the previously discovered itemsets, and then check if it is a subset of the current set of reviews, which are stored in k_1_itemsets (note that here, k_1 means k-1). If it is, this means that the user has reviewed each movie in the itemset. This is done by the itemset.issubset(reviews) line.

We can then go through each inpidual movie that the user has reviewed (that is not already in the itemset), create a superset by combining the itemset with the new movie and record that we saw this superset in our counting dictionary. These are the candidate frequent itemsets for this value of k.

We end our function by testing which of the candidate itemsets have enough support to be considered frequent and return only those that have a support more than our min_support value.

This function forms the heart of our Apriori implementation and we now create a loop that iterates over the steps of the larger algorithm, storing the new itemsets as we increase k from 1 to a maximum value. In this loop, k represents the length of the soon-to-be discovered frequent itemsets, allowing us to access the previously most discovered ones by looking in our frequent_itemsets dictionary using the key k - 1. We create the frequent itemsets and store them in our dictionary by their length. Let's look at the code:

for k in range(2, 20):
    # Generate candidates of length k, using the frequent itemsets of length k-1
    # Only store the frequent itemsets
    cur_frequent_itemsets = find_frequent_itemsets(favorable_reviews_by_users,
                                                   frequent_itemsets[k-1], min_support)
    if len(cur_frequent_itemsets) == 0:
        print("Did not find any frequent itemsets of length {}".format(k))
        sys.stdout.flush()
        break
    else:
        print("I found {} frequent itemsets of length {}".format(len(cur_frequent_itemsets), k))
        sys.stdout.flush()
        frequent_itemsets[k] = cur_frequent_itemsets

If we do find frequent itemsets, we print out a message to let us know the loop will be running again. If we don't, we stop iterating, as there cannot be frequent itemsets for k+1 if there are no frequent itemsets for the current value of k, therefore we finish the algorithm.

We use sys.stdout.flush() to ensure that the printouts happen while the code is still running. Sometimes, in large loops in particular cells, the printouts will not happen until the code has completed. Flushing the output in this way ensures that the printout happens when we want, rather than when the interface decides it can allocate the time to print. Don't flush too frequently though—the flush operation carries a computational cost (as does normal printing) and this will slow down the program.

You can now run the above code.
The preceding code returns about 2000 frequent itemsets of varying lengths. You'll notice that the number of itemsets grows as the length increases before it shrinks. It grows because of the increasing number of possible rules. After a while, the large number of combinations no longer has the support necessary to be considered frequent. This results in the number shrinking. This shrinking is the benefit of the Apriori algorithm. If we search all possible itemsets (not just the supersets of frequent ones), we would be searching thousands of times more itemsets to see if they are frequent.

Even if this shrinking didn't occur, the algorithm meets an absolute end when rules for a combination of all movies together is discovered. Therefore the Apriori algorithm will always terminate.

It may take a few minutes for this code to run, more if you have older hardware. If you find you are having trouble running any of the code samples, take a look at using an online cloud provider for additional speed. Details about using the cloud to do the work are given in Appendix, Next Steps.