k-means Clustering
k-means clustering is a very common unsupervised learning technique with a very wide range of applications. It is powerful because it is conceptually relatively simple, scales to very large datasets, and tends to work quite well in practice. In the following section, you will learn the conceptual foundations of k-means clustering, how to apply k-means clustering to data, and how to deal with high-dimensional data (that is, data with many different variables) in the context of clustering.
Understanding k-means Clustering
k-means clustering is an algorithm that tries to find the best way of grouping data points into k different groups, where k is a parameter given to the algorithm. For now, we will choose k arbitrarily. We will revisit how to choose k in practice in the next chapter. The algorithm then works iteratively to try to find the best grouping. There are two steps to this algorithm:
- The algorithm begins by randomly selecting k points in space to be the centroids of the clusters. Each data point is then assigned to the centroid that it is closest to it.
- The centroids are updated to be the mean of all of the data points assigned to them. The data points are then reassigned to the centroid closest to them.
Step two is repeated until none of the data points change the centroid they are assigned to after the centroid is updated.
One point to note here is that this algorithm is not deterministic, that is, the outcome of the algorithm depends on the starting locations of the centroids. Therefore, it is not always guaranteed to find the best grouping. However, in practice it tends to find good groupings while still being computationally inexpensive even for large datasets. k-means clustering is fast and easily scalable, and is therefore the most common clustering algorithm used.
Note
In the next chapter, you will learn about how to evaluate how good your grouping is, and explore other alternative algorithms for clustering.
Exercise 12: k-means Clustering on Income/Age Data
In this exercise, you will first standardize the age and income data from the ageinc.csv dataset provided within the Lesson03 folder on the GitHub repository for this book, and perform k-means clustering using the scikit-learn package:
- Open your Jupyter Notebook and import the pandas package:
import pandas as pd
- Load the ageinc.csv dataset present within the Lesson03 folder:
ageinc_df = pd.read_csv('ageinc.csv')
- Create the standardized value columns for the income and age values and store them in the z_income and z_age variables, using the following snippet:
ageinc_df['z_income'] = (ageinc_df['income'] - ageinc_df['income'].mean())/ageinc_df['income'].std()
ageinc_df['z_age'] = (ageinc_df['age'] - ageinc_df['age'].mean())/ageinc_df['age'].std()
- Use Matplotlib to plot the data to get a sense of what it looks like. For this, you need to first import pyplot. To make sure the plot shows up in the Jupyter Notebook, we will tell the notebook to allow Matplotlib to plot inline. Note that this only has to be done once per notebook where we're plotting. Finally, we will use a scatterplot to plot the data:
import matplotlib.pyplot as plt
%matplotlib inline
plt.scatter(ageinc_df['income'], ageinc_df['age'])
- Label the axes as "Income" and "Age" and use the following code to display the figure:
plt.xlabel('Income')
plt.ylabel('Age')
plt.show()
The resulting figure should look like this:
Figure 3.13: A scatterplot of the age and income data
- Now use the sklearn package, a package that has numerous machine learning algorithms, to perform k-means clustering, using the standardized variables. Use the following snippet to perform k-means clustering with four clusters:
from sklearn import cluster
model = cluster.KMeans(n_clusters=4, random_state=10)
model.fit(ageinc_df[['z_income','z_age']])
In the preceding snippet, we first imported the cluster module from the sklearn package. Then, we defined the model to be a k-means algorithm with specific parameters (four clusters; the random state just ensures that everyone gets the same answer since the k-means algorithm is not deterministic). The final line fits the model to our data. We specifically only fit it to our z_income and z_age columns, since we don't want to use the unstandardized variables for clustering.
- Next, we will create a column called cluster that contains the label of the cluster each data point belongs to, and use the head function to inspect the first few rows. Consider the following snippet:
ageinc_df['cluster'] = model.labels_
ageinc_df.head()
Your output will appear as follows:
Figure 3.14: The first few rows of the data with the clusters each data point is assigned to
- Finally, plot the data points, color, and shape, coded by which cluster they belong to. Use the unstandardized data to do the plotting so the variables are easier to interpret—since we already obtained the clustering from the standardized scores, this is just for visualization purposes, and the absolute value of the variables isn't important. We'll define the markers and colors we want to use for each cluster and then use a loop to plot the data points in each cluster separately with their respective color and shape. We then change the labels of the axes and display the figure:
colors = ['r', 'b', 'k', 'g']
markers = ['^', 'o', 'd', 's']
for c in ageinc_df['cluster'].unique():
d = ageinc_df[ageinc_df['cluster'] == c]
plt.scatter(d['income'], d['age'], marker=markers[c], color=colors[c])
plt.xlabel('Income')
plt.ylabel('Age')
plt.show()
The final plot you obtain should look as follows:
Figure 3.15: A plot of the data with the color/shape indicating which cluster each data point is assigned to
Congratulations! You've successfully performed k-means clustering using the scikit-learn package. In this exercise, we dealt with a dataset that had only two dimensions. In the next section, we'll take a look at how to deal with datasets containing more dimensions.
High-Dimensional Data
It's common to have data that has more than just two dimensions. For example, if in our age and income data we also had yearly spend, we would have three dimensions. If we had some information about how these customers responded to advertised sales, or how many purchases they had made of our products, or how many people lived in their household, we could have many more dimensions.
When we have additional dimensions, it becomes more difficult to visualize our data. In the previous exercise, we only had two variables, and so we could easily visualize data points and the clusters formed. With higher dimensional data, however, different techniques need to be used. Dimensionality reduction techniques are commonly used for this. The idea of dimensionality reduction is that data that is multi-dimensional is reduced, usually to two dimensions, for visualization purposes, while trying to preserve the distance between the points.
We will use principal component analysis (PCA) to perform dimensionality reduction. PCA is a method of transforming the data. It takes the original dimensions and creates new dimensions that capture the most variance in the data. In other words, it creates dimensions that contain the most amount of information about the data, so that when you take the first two principal components (dimensions), you are left with most of the information about the data, but reduced to only two dimensions:
Figure 3.16: How PCA works
Note
There are many other uses of PCA other than dimensionality reduction for visualization. You can read more about PCA here: https://towardsdatascience.com/principal-component-analysis-intro-61f236064b38.
Exercise 13: Dealing with High-Dimensional Data
In this exercise, we will deal with a dataset (three_col.csv) that has three columns. We will standardize the data and perform k-means clustering in a way that will scale to data with many columns. To visualize the data, we will perform dimensionality reduction using PCA:
- Open your Jupyter Notebook and import the pandas package:
import pandas as pd
- Read in the three_col.csv dataset present within the Lesson03 folder and inspect the columns:
df = pd.read_csv('three_col.csv')
df.head()
Figure 3.17: The first few rows of the data in the three_col.csv file
- Standardize the three columns and save the names of the standardized columns in a list, zcols. Use the following loop to standardize all of the columns instead of doing them one at a time:
cols = df.columns
zcols = []
for col in cols:
df['z_' + col] = (df[col] - df[col].mean())/df[col].std()
zcols.append('z_' + col)
- nspect the new columns using the head command, as follows:
df.head()
Figure 3.18: The first few rows of the data with the standardized columns
- Perform k-means clustering on the standardized scores. For this, you will first need to import the cluster module from the sklearn package. Then, define a k-means clustering object (model, in the following snippet) with the random_state set to 10 and using four clusters. Finally, we will use the fit_predict function to fit our k-means model to the standardized columns in our data as well as to label the data:
from sklearn import cluster
model = cluster.KMeans(n_clusters=4, random_state=10)
df['cluster'] = model.fit_predict(df[zcols])
- Now we will perform PCA on our data. For this, you need to first import the decomposition module from sklearn, define a PCA object with n_components set to 2, use this PCA object to transform the standardized data, and store the transformed dimensions in pc1 and pc2:
from sklearn import decomposition
pca = decomposition.PCA(n_components=2)
df['pc1'], df['pc2'] = zip(*pca.fit_transform(df[zcols]))
- Plot the clusters in the reduced dimensionality space, using the following loop to plot each cluster with its own shape and color:
import matplotlib.pyplot as plt
%matplotlib inline
colors = ['r', 'b', 'k', 'g']
markers = ['^', 'o', 'd', 's']
for c in df['cluster'].unique():
d = df[df['cluster'] == c]
plt.scatter(d['pc1'], d['pc2'], marker=markers[c], color=colors[c])
plt.show()
Figure 3.19: A plot of the data reduced to two dimensions denoting the various clusters
Note that the x and y axes here are principal components, and therefore are not easily interpretable. However, by visualizing the clusters, we can get a sense of how good the clusters are based on how much they overlap.
- To quickly investigate what each cluster seems to be capturing, we can look at the means of each of the variables in each cluster. Use the following snippet:
for cluster in df['cluster'].unique():
print("Cluster: " + str(cluster))
for col in ['income', 'age', 'days_since_purchase']:
print(col + ": {:.2f}".format(df.loc[df['cluster'] == cluster, col].mean()))
Here is a tabular representation of the output (notice the difference in the means between the four clusters):
Figure 3.20: The means of the three columns
Note
This is just one example of how to investigate the different clusters. You can also look at the first few examples of data points in each to get a sense of the differences. In more complex cases, using various visualization techniques to probe more deeply into the different clusters may be useful.
Congratulations! You have successfully used PCA for dimensionality reduction. We can see that each cluster has different characteristics: cluster 0 represents customers with high incomes, low ages, and relatively fewer days since the last purchase; cluster 1 represents low income, low age, and more days since the last purchase; cluster 2 represents low income, high age, and fewer days since last purchase; and cluster 3 has high income, high age, and more days since last purchase.
Activity 4: Using k-means Clustering on Customer Behavior Data
Imagine that you work for the marketing department of a company that sells different types of wine to customers. Your marketing team launched 32 initiatives over the past one year to increase the sales of wine (data for which is present in the offer_info.csv file in the Lesson03 folder). Your team has also acquired data that tells you which customers have responded to which of the 32 marketing initiatives recently (this data is present within the customer_offers.csv file). Your marketing team now wants to begin targeting their initiatives more precisely, so they can provide offers customized to groups that tend to respond to similar offers.
Note
Some knowledge of wine might be useful for drawing the inferences at the end of this activity. Feel free to Google the wine types to get an idea of what they are.
Your task is to use k-means clustering to discover a few groups of customers and explore what those groupings are and the types of offers that customers in those groups tend to respond to. Execute the following steps to complete this activity:
- Read in the data in the customer_offers.csv file and set the customer_name column to the index.
- Perform k-means clustering with three clusters and save the cluster that each data point is assigned to.
Note
We won't standardize the data this time, because all variables are binary. We will talk more about other variable types in the next chapter.
- Use PCA to visualize the clusters. Your plot will look as follows:
Figure 3.21: A plot of the data reduced to two dimensions denoting three clusters
- Investigate how each cluster differs from the average in each of our features. In other words, find the difference between the proportion of customers in each cluster that responded to an offer and the proportion of customers overall that responded to an offer, for each of the offers. Plot these differences on a bar chart. The outputs should appear as follows:
Figure 3.22: Plot for cluster 0
Figure 3.23: Plot for cluster 1
Figure 3.24: Plot for cluster 2
- Load the information about what the offers were from offer_info.csv. For each cluster, find the five offers where the data points in that cluster differ most from the mean, and print out the varietal of those offers. You should get the following values:
Figure 3.25: The five offers where the cluster differs most from the mean
Note
The solution for this activity can be found on page 334.