Introduction to k-means Clustering with Built-In Functions
In this section, we're going to use some built-in libraries of R to perform k-means clustering instead of writing custom code, which is lengthy and prone to bugs and errors. Using pre-built libraries instead of writing our own code has other advantages, too:
- Library functions are computationally efficient, as thousands of man hours have gone into the development of those functions.
- Library functions are almost bug-free as they've been tested by thousands of people in almost all practically-usable scenarios.
- Using libraries saves time, as you don't have to invest time in writing your own code.
k-means Clustering with Three Clusters
In the previous activity, we performed k-means clustering with three clusters by writing our own code. In this section, we're going to achieve a similar result with the help of pre-built R libraries.
At first, we're going to start with a distribution of three types of flowers in our dataset, as represented in the following graph:
Figure 1.17: A graph representing three species of iris in three colors
In the preceding plot, setosa is represented in blue, virginica in gray, and versicolor in pink.
With this dataset, we're going to perform k-means clustering and see whether the built-in algorithm is able to find a pattern on its own to classify these three species of iris using their sepal sizes. This time, we're going to use just four lines of code.
Exercise 3: k-means Clustering with R Libraries
In this exercise, we're going to learn to do k-means clustering in a much easier way with the pre-built libraries of R. By completing this exercise, you will be able to divide the three species of Iris into three separate clusters:
- We put the first two columns of the iris dataset, sepal length and sepal width, in the iris_data variable:
iris_data<-iris[,1:2]
- We find the k-means cluster centers and the cluster to which each point belongs, and store it all in the km.res variable. Here, in the kmeans, function we enter the dataset as the first parameter, and the number of clusters we want as the second parameter:
km.res<-kmeans(iris_data,3)
Note
The k-means function has many input variables, which can be altered to get different final outputs. You can find out more about them here in the documentation at https://www.rdocumentation.org/packages/stats/versions/3.5.1/topics/kmeans.
- Install the factoextra library as follows:
install.packages('factoextra')
- We import the factoextra library for visualization of the clusters we just created. Factoextra is an R package that is used for plotting multivariate data:
library("factoextra")
- Generate the plot of the clusters. Here, we need to enter the results of k-means as the first parameter. In data, we need to enter the data on which clustering was done. In pallete, we're selecting the type of the geometry of points, and in ggtheme, we're selecting the theme of the output plot:
fviz_cluster(km.res, data = iris_data,palette = "jco",ggtheme = theme_minimal())
The output will be as follows:
Figure 1.18: Three species of Iris have been clustered into three clusters
Here, if you compare Figure 1.18 to Figure 1.17, you will see that we have classified all three species almost correctly. The clusters we've generated don't exactly match the species shown in figure 1.18, but we've come very close considering the limitations of only using sepal length and width to classify them.
You can see from this example that clustering would've been a very useful way of categorizing the irises if we didn't already know their species. You will come across many examples of datasets where you don't have labeled categories, but are able to use clustering to form your own groupings.