Chapter 2. Supervised Learning
In We have the training data where each instance has an input (a set of attributes) and a desired output (a target class). Then we use this data to train a model that will predict the same target class for new unseen instances.
Supervised learning methods are nowadays a standard tool in a wide range of disciplines, from medical diagnosis to natural language processing, image recognition, and searching for new particles at the Large Hadron Collider (LHC). In this chapter we will present several methods applied to several real-world examples by using some of the many algorithms implemented in scikit-learn. This chapter does not intend to substitute the scikit-learn reference, but is an introduction to the main supervised learning techniques and shows how they can be used to solve practical problems.
Image recognition with Support Vector Machines
Imagine that the instances in your dataset are points in a multidimensional space; we can assume that the model built by our classifier can be a surface or using linear algebra terminology, a hyperplane that separates instances (points) of one class from the rest. Support Vector Machines (SVM) are supervised learning methods that try to obtain these hyperplanes in an optimal way, by selecting the ones that pass through the widest possible gaps between instances of different classes. New instances will be classified as belonging to a certain category based on which side of the surfaces they fall on.
The following figure shows an example for a two-dimensional space with two features (X1 and X2) and two classes (black and white):
We can observe that the green hyperplane does not separate both classes, committing some classification errors. The blue and the red hyperplanes separate both classes without errors. However, the red surface separates both classes with maximum margin; it is the most distant hyperplane from the closest instances from the two categories. The main advantage of this approach is that it will probably lower the generalization error, making this model resistant to overfitting, something that actually has been verified in several, different, classification tasks.
This approach can be generalized to construct hyperplanes not only in two dimensions, but also in high or infinite dimensional spaces. What is more, we can use nonlinear surfaces, such as polynomial or radial basis functions, by using the so called kernel trick, implicitly mapping inputs into high-dimensional feature spaces.
SVM has become one of the state-of-the-art machine learning models for many tasks with excellent results in many practical applications. One of the greatest advantages of SVM is that they are very effective when working on high-dimensional spaces, that is, on problems which have a lot of features to learn from. They are also very effective when the data is sparse (think about a high-dimensional space with very few instances). Besides, they are very efficient in terms of memory storage, since only a subset of the points in the learning space is used to represent the decision surfaces.
To mention some disadvantages, SVM models could be very calculation intensive while training the model and they do not return a numerical indicator of how confident they are about a prediction. However, we can use some techniques such as K-fold cross-validation to avoid this, at the cost of increasing the computational cost.
We will apply SVM to image recognition, a classic problem with a very large dimensional space (the value of each pixel of the image is considered as a feature). What we will try to do is, given an image of a person's face, predict to which of the possible people from a list does it belongs (this kind of approach is used, for example, in social network applications to automatically tag people within photographs). Our learning set will be a group of labeled images of peoples' faces, and we will try to learn a model that can predict the label of unseen instances. The intuitive and first approach would be to use the image pixels as features for the learning algorithm, so pixel values will be our learning attributes and the individual's label will be our target class.
Our dataset is provided within scikit-learn, so let's start by importing and printing its description.
>>> import sklearn as sk >>> import numpy as np >>> import matplotlib.pyplot as plt >>> from sklearn.datasets import fetch_olivetti_faces >>> faces = fetch_olivetti_faces() >>> print faces.DESCR
The dataset contains 400 images of 40 different persons. The photos were taken with different light conditions and facial expressions (including open/closed eyes, smiling/not smiling, and with glasses/no glasses). For additional information about the dataset refer to http://www.cl.cam.ac.uk/research/dtg/attarchive/facedatabase.html.
Looking at the content of the faces
object, we get the following properties: images
, data
, and target
. Images contain the 400 images represented as 64 x 64 pixel matrices. data
contains the same 400 images but as array of 4096 pixels. target
is, as expected, an array with the target classes, ranging from 0 to 39.
>>> print faces.keys() ['images', 'data', 'target', 'DESCR'] >>> print faces.images.shape (400, 64, 64) >>> print faces.data.shape (400, 4096) >>> print faces.target.shape (400,)
Normalizing the data is important as we saw in the previous chapter. It is also important for the application of SVM to obtain good results. In our particular case, we can verify by running the following snippet that our images already come as values in a very uniform range between 0 and 1 (pixel value):
>>> print np.max(faces.data) 1.0 >>> print np.min(faces.data) 0.0 >>> print np.mean(faces.data) 0.547046432495
Therefore, we do not have to normalize the data. Before learning, let's plot some faces. We will define the following helper
function:
>>> def print_faces(images, target, top_n): >>> # set up the figure size in inches >>> fig = plt.figure(figsize=(12, 12)) >>> fig.subplots_adjust(left=0, right=1, bottom=0, top=1, hspace=0.05, wspace=0.05) >>> for i in range(top_n): >>> # plot the images in a matrix of 20x20 >>> p = fig.add_subplot(20, 20, i + 1, xticks=[], yticks=[]) >>> p.imshow(images[i], cmap=plt.cm.bone) >>> >>> # label the image with the target value >>> p.text(0, 14, str(target[i])) >>> p.text(0, 60, str(i))
If we print the first 20 images, we can see faces from two persons.
>>> print_faces(faces.images, faces.target, 20)
Training a Support Vector Machine
To use SVM in scikit-learn to solve our task, we will import the SVC
class from the sklearn.svm
module:
>>> from sklearn.svm import SVC
The Support Vector Classifier (SVC) will be used for classification. In the last section of this chapter, we will use SVM for regression tasks.
The SVC implementation has different important parameters; probably the most relevant is kernel
, which defines the kernel function to be used in our classifier (think of the kernel functions as different similarity measures between instances). By default, the SVC
class uses the rbf
kernel, which allows us to model nonlinear problems. To start, we will use the simplest kernel, the linear
one.
>>> svc_1 = SVC(kernel='linear')
Before continuing, we will split our dataset into training and testing datasets.
>>> from sklearn.cross_validation import train_test_split >>> X_train, X_test, y_train, y_test = train_test_split( faces.data, faces.target, test_size=0.25, random_state=0)
And we will define a function to evaluate K-fold cross-validation.
>>> from sklearn.cross_validation import cross_val_score, KFold >>> from scipy.stats import sem >>> >>> def evaluate_cross_validation(clf, X, y, K): >>> # create a k-fold croos validation iterator >>> cv = KFold(len(y), K, shuffle=True, random_state=0) >>> # by default the score used is the one returned by score method of the estimator (accuracy) >>> scores = cross_val_score(clf, X, y, cv=cv) >>> print scores >>> print ("Mean score: {0:.3f} (+/-{1:.3f})").format( np.mean(scores), sem(scores)) >>> evaluate_cross_validation(svc_1, X_train, y_train, 5) [ 0.93333333 0.91666667 0.95 0.95 0.91666667] Mean score: 0.933 (+/-0.007)
Cross-validation with five folds, obtains pretty good results (accuracy of 0.933). In a few steps we obtained a face classifier.
We will also define a function to perform training on the training set and evaluate the performance on the testing set.
>>> from sklearn import metrics >>> >>> def train_and_evaluate(clf, X_train, X_test, y_train, y_test): >>> >>> clf.fit(X_train, y_train) >>> >>> print "Accuracy on training set:" >>> print clf.score(X_train, y_train) >>> print "Accuracy on testing set:" >>> print clf.score(X_test, y_test) >>> >>> y_pred = clf.predict(X_test) >>> >>> print "Classification Report:" >>> print metrics.classification_report(y_test, y_pred) >>> print "Confusion Matrix:" >>> print metrics.confusion_matrix(y_test, y_pred)
If we train and evaluate, the classifier performs the operation with almost no errors.
>>> train_and_evaluate(svc_1, X_train, X_test, y_train, y_test) Accuracy on training set: 1.0 Accuracy on testing set: 0.99
Let's do a little more, why don't we try to classify the faces as people with and without glasses? Let's do that.
First thing to do is to define the range of the images that show faces wearing glasses. The following list shows the indexes of these images:
>>> # the index ranges of images of people with glasses >>> glasses = [ (10, 19), (30, 32), (37, 38), (50, 59), (63, 64), (69, 69), (120, 121), (124, 129), (130, 139), (160, 161), (164, 169), (180, 182), (185, 185), (189, 189), (190, 192), (194, 194), (196, 199), (260, 269), (270, 279), (300, 309), (330, 339), (358, 359), (360, 369) ]
You can check these values by using the print_faces
function that was defined before to plot the 400 faces and looking at the indexes in the lower-left corners.
Then we'll define a function that from those segments returns a new target array that marks with 1
for the faces with glasses and 0
for the faces without glasses (our new target classes):
>>> def create_target(segments): >>> # create a new y array of target size initialized with zeros >>> y = np.zeros(faces.target.shape[0]) >>> # put 1 in the specified segments >>> for (start, end) in segments: >>> y[start:end + 1] = 1 >>> return y >>> target_glasses = create_target(glasses)
So we must perform the training/testing split again.
>>> X_train, X_test, y_train, y_test = train_test_split( faces.data, target_glasses, test_size=0.25, random_state=0)
Now let's create a new SVC classifier, and train it with the new target vector using the following command:
>>> svc_2 = SVC(kernel='linear')
If we check the performance with cross-validation by the following code:
>>> evaluate_cross_validation(svc_2, X_train, y_train, 5) [ 0.98333333 0.98333333 0.93333333 0.96666667 0.96666667] Mean score: 0.967 (+/-0.009)
We obtain a mean accuracy of 0.967 with cross-validation if we evaluate on our testing set.
>>> train_and_evaluate(svc_2, X_train, X_test, y_train, y_test) Accuracy on training set: 1.0 Accuracy on testing set: 0.99 Classification Report: precision recall f1-score support 0 1.00 0.99 0.99 67 1 0.97 1.00 0.99 33 avg / total 0.99 0.99 0.99 100 Confusion Matrix: [[66 1] [ 0 33]]
Could it be possible that our classifier has learned to identify peoples' faces associated with glasses and without glasses precisely? How can we be sure that this is not happening and that if we get new unseen faces, it will work as expected? Let's separate all the images of the same person, sometimes wearing glasses and sometimes not. We will also separate all the images of the same person, the ones with indexes from 30 to 39, train by using the remaining instances, and evaluate on our new 10 instances set. With this experiment we will try to discard the fact that it is remembering faces, not glassed-related features.
>>> X_test = faces.data[30:40] >>> y_test = target_glasses[30:40] >>> print y_test.shape[0] 10 >>> select = np.ones(target_glasses.shape[0]) >>> select[30:40] = 0 >>> X_train = faces.data[select == 1] >>> y_train = target_glasses[select == 1] >>> print y_train.shape[0] 390 >>> svc_3 = SVC(kernel='linear') >>> train_and_evaluate(svc_3, X_train, X_test, y_train, y_test) Accuracy on training set: 1.0 Accuracy on testing set: 0.9 Classification Report: precision recall f1-score support 0 0.83 1.00 0.91 5 1 1.00 0.80 0.89 5 avg / total 0.92 0.90 0.90 10 Confusion Matrix: [[5 0] [1 4]]
From the 10 images, only one error, still pretty good results, let's check out which one was incorrectly classified. First, we have to reshape the data from arrays to 64 x 64 matrices:
>>> y_pred = svc_3.predict(X_test) >>> eval_faces = [np.reshape(a, (64, 64)) for a in X_test]
Then plot with our print_faces
function:
>>> print_faces(eval_faces, y_pred, 10)
The image number 8 in the preceding figure has glasses and was classified as no glasses. If we look at that instance, we can see that it is different from the rest of the images with glasses (the border of the glasses cannot be seen clearly and the person is shown with closed eyes), which could be the reason it has been misclassified.
With a few lines, we created a face classifier with a linear SVM model. Usually we would not get such good results in the first trial. In these cases, (besides looking at different features) we can start tweaking the hyperparameters of our algorithm. In the particular case of SVM, we can try with different kernel functions; if linear does not give good results, we can try with polynomial or RBF kernels. Also the C
and the gamma
parameters may affect the results. For a description of the arguments and its values, please refer to the scikit-learn documentation.
Text classification with Naïve Bayes
Naïve Bayes is a simple but powerful classifier based on a probabilistic model derived from the Bayes' theorem. Basically it determines the probability that an instance belongs to a class based on each of the feature value probabilities. The naïve term comes from the fact that it assumes that each feature is independent of the rest, that is, the value of a feature has no relation to the value of another feature.
Despite being very simple, it has been used in many domains with very good results. The independence assumption, although a naïve and strong simplification, is one of the features that make the model useful in practical applications. Training the model is reduced to the calculation of the involved conditional probabilities, which can be estimated by counting frequencies of correlations between feature values and class values.
One of the most successful applications of Naïve Bayes has been within the field of Natural Language Processing (NLP). NLP is a field that has been much related to machine learning, since many of its problems can be formulated as a classification task. Usually, NLP problems have important amounts of tagged data in the form of text documents. This data can be used as a training dataset for machine learning algorithms.
In this section, we will use Naïve Bayes for text classification; we will have a set of text documents with their corresponding categories, and we will train a Naïve Bayes algorithm to learn to predict the categories of new unseen instances. This simple task has many practical applications; probably the most known and widely used one is spam filtering. In this section we will try to classify newsgroup messages using a dataset that can be retrieved from within scikit-learn. This dataset consists of around 19,000 newsgroup messages from 20 different topics ranging from politics and religion to sports and science.
As usual, we first start by importing our pylab
environment:
>>> %pylab inline
Our dataset can be obtained by importing the fetch_20newgroups
function from the sklearn.datasets
module. We have to specify if we want to import a part or all of the set of instances (we will import all of them).
>>> from sklearn.datasets import fetch_20newsgroups >>> news = fetch_20newsgroups(subset='all')
If we look at the properties of the dataset, we will find that we have the usual ones: DESCR
, data
, target
, and target_names
. The difference now is that data holds a list of text contents, instead of a numpy
matrix:
>>> print type(news.data), type(news.target), type(news.target_names) <type 'list'> <type 'numpy.ndarray'> <type 'list'> >>> print news.target_names ['alt.atheism', 'comp.graphics', 'comp.os.ms-windows.misc', 'comp.sys.ibm.pc.hardware', 'comp.sys.mac.hardware', 'comp.windows.x', 'misc.forsale', 'rec.autos', 'rec.motorcycles', 'rec.sport.baseball', 'rec.sport.hockey', 'sci.crypt', 'sci.electronics', 'sci.med', 'sci.space', 'soc.religion.christian', 'talk.politics.guns', 'talk.politics.mideast', 'talk.politics.misc', 'talk.religion.misc'] >>> print len(news.data) 18846 >>> print len(news.target) 18846
If you look at, say, the first instance, you will see the content of a newsgroup message, and you can get its corresponding category:
>>> print news.data[0] >>> print news.target[0], news.target_names[news.target[0]]
Preprocessing the data
Our machine learning algorithms can work only on numeric data, so our next step will be to convert our text-based dataset to a numeric dataset. Currently we only have one feature, the text content of the message; we need some function that transforms a text into a meaningful set of numeric features. Intuitively one could try to look at which are the words (or more precisely, tokens, including numbers or punctuation signs) that are used in each of the text categories, and try to characterize each category with the frequency distribution of each of those words. The sklearn.feature_extraction.text
module has some useful utilities to build numeric feature vectors from text documents.
Before starting the transformation, we will have to partition our data into training and testing set. The loaded data is already in a random order, so we only have to split the data into, for example, 75 percent for training and the rest 25 percent for testing:
>>> SPLIT_PERC = 0.75 >>> split_size = int(len(news.data)*SPLIT_PERC) >>> X_train = news.data[:split_size] >>> X_test = news.data[split_size:] >>> y_train = news.target[:split_size] >>> y_test = news.target[split_size:]
If you look inside the sklearn.feature_extraction.text
module, you will find three different classes that can transform text into numeric features: CountVectorizer
, HashingVectorizer
, and TfidfVectorizer
. The difference between them resides in the calculations they perform to obtain the numeric features. CountVectorizer
basically creates a dictionary of words from the text corpus. Then, each instance is converted to a vector of numeric features where each element will be the count of the number of times a particular word appears in the document.
HashingVectorizer
, instead of constricting and maintaining the dictionary in memory, implements a hashing function that maps tokens into feature indexes, and then computes the count as in CountVectorizer
.
TfidfVectorizer
works like the CountVectorizer
, but with a more advanced calculation called Term Frequency Inverse Document Frequency (TF-IDF). This is a statistic for measuring the importance of a word in a document or corpus. Intuitively, it looks for words that are more frequent in the current document, compared with their frequency in the whole corpus of documents. You can see this as a way to normalize the results and avoid words that are too frequent, and thus not useful to characterize the instances.
Training a Naïve Bayes classifier
We will create a Naïve Bayes classifier that is composed of a feature vectorizer and the actual Bayes classifier. We will use the MultinomialNB
class from the sklearn.naive_bayes
module. In order to compose the classifier with the vectorizer, as we saw in available in the sklearn.pipeline
module) that eases the construction of a compound classifier, which consists of several vectorizers and classifiers.
We will create three different classifiers by combining MultinomialNB
with the three different text vectorizers just mentioned, and compare which one performs better using the default parameters:
>>> from sklearn.naive_bayes import MultinomialNB >>> from sklearn.pipeline import Pipeline >>> from sklearn.feature_extraction.text import TfidfVectorizer, HashingVectorizer, CountVectorizer >>> clf_1 = Pipeline([ >>> ('vect', CountVectorizer()), >>> ('clf', MultinomialNB()), >>> ]) >>> clf_2 = Pipeline([ >>> ('vect', HashingVectorizer(non_negative=True)), >>> ('clf', MultinomialNB()), >>> ]) >>> clf_3 = Pipeline([ >>> ('vect', TfidfVectorizer()), >>> ('clf', MultinomialNB()), >>> ])
We will define a function that takes a classifier and performs the K-fold cross-validation over the specified X
and y
values:
>>> from sklearn.cross_validation import cross_val_score, KFold >>> from scipy.stats import sem >>> >>> def evaluate_cross_validation(clf, X, y, K): >>> # create a k-fold cross validation iterator of k=5 folds >>> cv = KFold(len(y), K, shuffle=True, random_state=0) >>> # by default the score used is the one returned by score >>> method of the estimator (accuracy) >>> scores = cross_val_score(clf, X, y, cv=cv) >>> print scores >>> print ("Mean score: {0:.3f} (+/-{1:.3f})").format( >>> np.mean(scores), sem(scores))
Then we will perform a five-fold cross-validation by using each one of the classifiers.
>>> clfs = [clf_1, clf_2, clf_3] >>> for clf in clfs: >>> evaluate_cross_validation(clf, news.data, news.target, 5)
These calculations may take some time; the results are as follows:
[ 0.86813478 0.86415495 0.86893075 0.85831786 0.8729443 ] Mean score: 0.866 (+/-0.002) [ 0.76359777 0.77182276 0.77765986 0.76147519 0.78222812] Mean score: 0.771 (+/-0.004) [ 0.86282834 0.85195012 0.86282834 0.85619528 0.87612732] Mean score: 0.862 (+/-0.004)
As you can see CountVectorizer
and TfidfVectorizer
had similar performances, and much better than HashingVectorizer
.
Let's continue with TfidfVectorizer
; we could try to improve the results by trying to parse the text documents into tokens with a different regular expression.
>>> clf_4 = Pipeline([ >>> ('vect', TfidfVectorizer( >>> token_pattern=ur"\b[a-z0-9_\-\.]+[a[a-z0-9_- >>> \.]+\b", >>> )), >>> ('clf', MultinomialNB()), >>> ])
The default regular expression: ur"\b\w\w+\b
" considers alphanumeric characters and the underscore. Perhaps also considering the slash and the dot could improve the tokenization, and begin considering tokens as Wi-Fi and site.com. The new regular expression could be: ur"\b[a-z0-9_\-\.]+[a-z][a-z0-9_\-\.]+\b"
. If you have queries about how to define regular expressions, please refer to the Python re
module documentation. Let's try our new classifier:
>>> evaluate_cross_validation(clf_4, news.data, news.target, 5) [ 0.87078801 0.86309366 0.87689042 0.86574688 0.8795756 ] Mean score: 0.871 (+/-0.003)
We have a slight improvement from 0.86 to 0.87.
Another parameter that we can use is stop_words
: this argument allows us to pass a list of words we do not want to take into account, such as too frequent words, or words we do not a priori expect to provide information about the particular topic.
We will define a function to load the stop words from a text file as follows:
>>> def get_stop_words(): >>> result = set() >>> for line in open('stopwords_en.txt', 'r').readlines(): >>> result.add(line.strip()) >>> return result
And create a new classifier with this new parameter as follows:
>>> clf_5 = Pipeline([ >>> ('vect', TfidfVectorizer( >>> stop_words= get_stop_words(), >>> token_pattern=ur"\b[a-z0-9_\-\.]+[a-z][a-z0->>> 9_\-\.]+\b", >>> )), >>> ('clf', MultinomialNB()), >>> ]) >>> evaluate_cross_validation(clf_5, news.data, news.target, 5) [ 0.88989122 0.8837888 0.89042186 0.88325816 0.89655172] Mean score: 0.889 (+/-0.002)
The preceding code shows another improvement from 0.87 to 0.89.
Let's keep this vectorizer and start looking at the MultinomialNB
parameters. This classifier has few parameters to tweak; the most important is the alpha
parameter, which is a smoothing parameter. Let's set it to a lower value; instead of setting alpha to 1.0
(the default value), we will set it to 0.01
:
>>> clf_7 = Pipeline([ >>> ('vect', TfidfVectorizer( >>> stop_words=get_stop_words() >>> token_pattern=ur"\b[a-z0-9_\-\.]+[a-z][a-z0->>> 9_\-\.]+\b", >>> )), >>> ('clf', MultinomialNB(alpha=0.01)), >>> ]) >>> evaluate_cross_validation(clf_7, news.data, news.target, 5) [ 0.92305651 0.91377023 0.92066861 0.91907668 0.92281167] Mean score: 0.920 (+/-0.002)
The results had an important boost from 0.89 to 0.92, pretty good. At this point, we could continue doing trials by using different values of alpha or doing new modifications of the vectorizer. In keep the best one. But for now, let's look a little more at our Naïve Bayes model.
Evaluating the performance
If we decide that we have made enough improvements in our model, we are ready to evaluate its performance on the testing set.
We will define a helper function that will train the model in the entire training set and evaluate the accuracy in the training and in the testing sets. It will also print a classification report (precision and recall on every class) and the corresponding confusion matrix:
>>> from sklearn import metrics >>> >>> def train_and_evaluate(clf, X_train, X_test, y_train, y_test): >>> >>> clf.fit(X_train, y_train) >>> >>> print "Accuracy on training set:" >>> print clf.score(X_train, y_train) >>> print "Accuracy on testing set:" >>> print clf.score(X_test, y_test) >>> y_pred = clf.predict(X_test) >>> >>> print "Classification Report:" >>> print metrics.classification_report(y_test, y_pred) >>> print "Confusion Matrix:" >>> print metrics.confusion_matrix(y_test, y_pred)
We will evaluate our best classifier.
>>> train_and_evaluate(clf_7, X_train, X_test, y_train, y_test) Accuracy on training set: 0.99398613273 Accuracy on testing set: 0.913837011885
As we can see, we obtained very good results, and as we would expect, the accuracy in the training set is quite better than in the testing set. We may expect, in new unseen instances, an accuracy of around 0.91.
If we look inside the vectorizer, we can see which tokens have been used to create our dictionary:
>>> print len(clf_7.named_steps['vect'].get_feature_names()) 61236
This shows that the dictionary is composed of 61236 tokens. Let's print the feature names.
>>> clf_7.named_steps['vect'].get_feature_names()
The following table presents an extract of the results:
You can see that some words are semantically very similar, for example, sand and sands, sanctuaries and sanctuary. Perhaps if the plurals and the singulars are counted to the same bucket, we would better represent the documents. This is a very common task, which could be solved using stemming, a technique that relates two words having the same lexical root.
Explaining Titanic hypothesis with decision trees
A common argument against linear classifiers and against statistical learning methods is that it is difficult to explain how the built model decides its predictions for the target classes. If you have a highly dimensional SVM, it is impossible for a human being to even imagine how the hyperplane built looks like. A Naïve Bayes classifier will tell you something like: "this class is the most probable, assuming it comes from a similar distribution as the training data, and making a few more assumptions" something not very useful, for example, we want to know why this or that mail should be considered as spam.
decision trees are very simple yet powerful supervised learning methods, which constructs a decision tree model, which will be used to make predictions. The following figure shows a very simple decision tree to decide if an e-mail should be considered spam:
It first asks if the e-mail contains the word Viagra; if the answer is yes, it classifies it as spam; if the answer is no, it further asks if it comes from somebody in your contacts list; this time, if the answer is yes, it classifies the e-mail as Ham; if the answer is no, it classify it as spam. The main advantage of this model is that a human being can easily understand and reproduce the sequence of decisions (especially if the number of attributes is small) taken to predict the target class of a new instance. This is very important for tasks such as medical diagnosis or credit approval, where we want to show a reason for the decision, rather than just saying this is what the training data suggests (which is, by definition, what every supervised learning method does). In this section, we will show you through a working example what decision trees look like, how they are built, and how they are used for prediction.
The problem we would like to solve is to determine if a Titanic's passenger would have survived, given her age, passenger class, and sex. We will work with the Titanic dataset that can be downloaded from http://biostat.mc.vanderbilt.edu/wiki/pub/Main/DataSets/titanic.txt. Like every other example in this chapter, we start with a dataset that includes the list of Titanic's passengers and a feature indicating whether they survived or not. Each instance in the dataset has the following form:
"1","1st",1,"Allen, Miss Elisabeth Walton",29.0000,"Southampton","St Louis, MO","B-5","24160 L221","2","female"
The list of attributes is: Ordinal
, Class
, Survived
(0
=no
, 1
=yes
), Name
, Age
, Port of Embarkation
, Home/Destination
, Room
, Ticket
, Boat
, and Sex
. We will start by loading the dataset into a numpy
array.
>>> import csv >>> import numpy as np >>> with open('data/titanic.csv', 'rb') as csvfile: >>> titanic_reader = csv.reader(csvfile, delimiter=',', >>> quotechar='"') >>> >>> # Header contains feature names >>> row = titanic_reader.next() >>> feature_names = np.array(row) >>> >>> # Load dataset, and target classes >>> titanic_X, titanic_y = [], [] >>> for row in titanic_reader: >>> titanic_X.append(row) >>> titanic_y.append(row[2]) # The target value is "survived" >>> >>> titanic_X = np.array(titanic_X) >>> titanic_y = np.array(titanic_y)
The code shown uses the Python csv
module to load the data.
>>> print feature_names ['row.names' 'pclass' 'survived' 'name' 'age' 'embarked' 'home.dest' 'room' 'ticket' 'boat' 'sex'] >>> print titanic_X[0], titanic_y[0] ['1' '1st' '1' 'Allen, Miss Elisabeth Walton' '29.0000' 'Southampton' 'St Louis, MO' 'B-5' '24160 L221' '2' 'female'] 1
Preprocessing the data
The first step we must take is to select the attributes we will use for learning:
>>> # we keep class, age and sex >>> titanic_X = titanic_X[:, [1, 4, 10]] >>> feature_names = feature_names[[1, 4, 10]]
We have selected feature numbers 1
, 4
, and 10
that is class, age, and sex, based on the assumption that the remaining attributes have no effect on the passenger's survival. Feature selection is an extremely important step while creating a machine learning solution. If the algorithm does not have good features as input, it will not have good enough material to learn from, results won't be good, no matter even if we have the best machine learning algorithm ever designed.
Sometimes the feature selection will be made manually, based on our knowledge of the problem's domain and the machine learning method we are planning to use. Sometimes feature selection may be done by using automatic tools to evaluate and select the most promising ones. In where there is a small number of instances with each value, present a similar problem (they might not be useful for generalization). We will use class, age, and sex because a priori, we expect them to have influenced the passenger's survival.
Now, our learning data looks like:
>>> print feature_names ['pclass' 'age' 'sex'] >>> print titanic_X[12],titanic_y[12] ['1st' 'NA' 'female'] 1
We have shown instance number 12
because it poses a problem to solve; one of its features (the age) is not available. We have missing values, a usual problem with datasets. In this case, we decided to substitute missing values with the mean age in the training data. We could have taken a different approach, for example, using the most common value in the training data, or the median value. When we substitute missing values, we have to understand that we are modifying the original problem, so we have to be very careful with what we are doing. This is a general rule in machine learning; when we change data, we should have a clear idea of what we are changing, to avoid skewing the final results.
>>> # We have missing values for age >>> # Assign the mean value >>> ages = titanic_X[:, 1] >>> mean_age = np.mean(titanic_X[ages != 'NA', 1].astype(np.float)) >>> titanic_X[titanic_X[:, 1] == 'NA', 1] = mean_age
The implementation of decision trees in scikit-learn expects as input a list of real-valued features, and the decision rules of the model would be of the form:
Feature <= value
For example, age <= 20.0
. Our attributes (except for age) are categorical; that is, they correspond to a value taken from a discrete set such as male and female. So, we have to convert categorical data into real values. Let's start with the sex feature. The preprocessing module of scikit-learn includes a LabelEncoder
class, whose fit
method allows conversion of a categorical set into a 0..K-1
integer, where K
is the number of different classes in the set (in the case of sex, just 0 or 1):
>>> # Encode sex >>> from sklearn.preprocessing import LabelEncoder >>> enc = LabelEncoder() >>> label_encoder = enc.fit(titanic_X[:, 2]) >>> print "Categorical classes:", label_encoder.classes_ Categorical classes: ['female' 'male'] >>> integer_classes = label_encoder.transform(label_encoder.classes_) >>> print "Integer classes:", integer_classes Integer classes: [0 1] >>> t = label_encoder.transform(titanic_X[:, 2]) >>> titanic_X[:, 2] = t
The last two sentences transform the values of the sex attribute into 0-1
values, and modify the training set.
print feature_names ['pclass' 'age' 'sex'] print titanic_X[12], titanic_y[12] ['1st' '31.1941810427' '0'] 1
We still have a categorical attribute: class
. We could use the same approach and convert its three classes into 0, 1, and 2. This transformation implicitly introduces an ordering between classes, something that is not an issue in our problem. However, we will try a more general approach that does not assume an ordering, and it is widely used to convert categorical classes into real-valued attributes. We will introduce an additional encoder and convert the class attributes into three new binary features, each of them indicating if the instance belongs to a feature value (1)
or (0)
. This is called one hot encoding, and it is a very common way of managing categorical attributes for real-based methods:
>>> from sklearn.preprocessing import OneHotEncoder >>> >>> enc = LabelEncoder() >>> label_encoder = enc.fit(titanic_X[:, 0]) >>> print "Categorical classes:", label_encoder.classes_ Categorical classes: ['1st' '2nd' '3rd'] >>> integer_classes = label_encoder.transform(label_encoder.classes_).reshape(3, 1) >>> print "Integer classes:", integer_classes Integer classes: [[0] [1] [2]] >>> enc = OneHotEncoder() >>> one_hot_encoder = enc.fit(integer_classes) >>> # First, convert classes to 0-(N-1) integers using label_encoder >>> num_of_rows = titanic_X.shape[0] >>> t = label_encoder.transform(titanic_X[:, 0]).reshape(num_of_rows, 1) >>> # Second, create a sparse matrix with three columns, each one indicating if the instance belongs to the class >>> new_features = one_hot_encoder.transform(t) >>> # Add the new features to titanix_X >>> titanic_X = np.concatenate([titanic_X, new_features.toarray()], axis = 1) >>> #Eliminate converted columns >>> titanic_X = np.delete(titanic_X, [0], 1) >>> # Update feature names >>> feature_names = ['age', 'sex', 'first_class', 'second_class', 'third_class'] >>> # Convert to numerical values >>> titanic_X = titanic_X.astype(float) >>> titanic_y = titanic_y.astype(float)
The preceding code first converts the classes into integers and then uses the OneHotEncoder
class to create the three new attributes that are added to the array of features. It finally eliminates from training data the original class
feature.
>>> print feature_names ['age', 'sex', 'first_class', 'second_class', 'third_class'] >>> print titanic_X[0], titanic_y[0] [29. 0. 1. 0. 0.] 1.0
We have now a suitable learning set for scikit-learn to learn a decision tree. Also, standardization is not an issue for decision trees because the relative magnitude of features does not affect the classifier performance.
The preprocessing step is usually underestimated in machine learning methods, but as we can see even in this very simple example, it can take some time to make data look as our methods expect. It is also very important in the overall machine learning process; if we fail in this step (for example, incorrectly encoding attributes, or selecting the wrong features), the following steps will fail, no matter how good the method we use for learning.
Training a decision tree classifier
Now to the interesting part; let's build a decision tree from our training data. As usual, we will first separate training and testing data.
>>> from sklearn.cross_validation import train_test_split >>> X_train, X_test, y_train, y_test = train_test_split(titanic_X, >>> titanic_y, test_size=0.25, random_state=33)
Now, we can create a new DecisionTreeClassifier
and use the fit
method of the classifier to do the learning job.
>>> from sklearn import tree >>> clf = tree.DecisionTreeClassifier(criterion='entropy', max_depth=3,min_samples_leaf=5) >>> clf = clf.fit(X_train,y_train)
DecisionTreeClassifier
accepts (as most learning methods) several hyperparameters that control its behavior. In this case, we used the Information Gain (IG) criterion for splitting learning data, told the method to build a tree of at most three levels, and to accept a node as a leaf if it includes at least five training instances. To explain this and show how decision trees work, let's visualize the model built. The following code assumes you are using IPython and that your Python distribution includes the pydot
module. Also, it allows generation of Graphviz code from the tree and assumes that Graphviz itself is installed. For more information about Graphviz, please refer to http://www.graphviz.org/.
>>> import pydot,StringIO >>> dot_data = StringIO.StringIO() >>> tree.export_graphviz(clf, out_file=dot_data, feature_names=['age','sex','1st_class','2nd_class' '3rd_class']) >>> graph = pydot.graph_from_dot_data(dot_data.getvalue()) >>> graph.write_png('titanic.png') >>> from IPython.core.display import Image >>> Image(filename='titanic.png')
The decision tree we have built represents a series of decisions based on the training data. To classify an instance, we should answer the question at each node. For example, at our root node, the question is: Is sex<=0.5? (are we talking about a woman?). If the answer is yes, you go to the left child node in the tree; otherwise you go to the right child node. You keep answering questions (was she in the third class?, was she in the first class?, and was she below 13 years old?), until you reach a leaf. When you are there, the prediction corresponds to the target class that has most instances (that is if the answers are given to the previous questions). In our case, if she was a woman from second class, the answer would be 1 (that is she survived), and so on.
You might be asking how our method decides which questions should be asked in each step. The answer is Information Gain (IG) (or the Gini index, which is a similar measure of disorder used by scikit-learn). IG measures how much entropy we lose if we answer the question, or alternatively, how much surer we are after answering it. Entropy is a measure of disorder in a set, if we have zero entropy, it means all values are the same (in our case, all instances of the target classes are the same), while it reaches its maximum when there is an equal number of instances of each class (in our case, when half of the instances correspond to survivors and the other half to non survivors). At each node, we have a certain number of instances (starting from the whole dataset), and we measure its entropy. Our method will select the questions that yield more homogeneous partitions (with the lowest entropy), when we consider only those instances for which the answer for the question is yes or no, that is, when the entropy after answering the question decreases.
Interpreting the decision tree
As you can see in the tree, at the beginning of the decision tree growing process, you have the 984 instances in the training set, 662 of them corresponding to class 0
(fatalities), and 322 of them to class 1
(survivors). The measured entropy for this initial group is about 0.632. From the possible list of questions we can ask, the one that produces the greatest information gain is: Was she a woman? (remember that the female category was encoded as 0
). If the answer is yes, entropy is almost the same, but if the answer is no, it is greatly reduced (the proportion of men who died was much greater than the general proportion of casualties). In this sense, the woman question seems to be the best to ask. After that, the process continues, working in each node only with the instances that have feature values that correspond to the questions in the path to the node.
If you look at the tree, in each node we have: the question, the initial Shannon entropy, the number of instances we are considering, and their distribution with respect to the target class. In each step, the number of instances gets reduced to those that answer yes (the left branch) and no (the right branch) to the question posed by that node. The process continues until a certain stopping criterion is met (in our case, until we have a fourth-level node, or the number of considered samples is lower than five).
At prediction time, we take an instance and start traversing the tree, answering the questions based on the instance features, until we reach a leaf. At this point, we look at to how many instances of each class we had in the training set, and select the class to which most instances belonged.
For example, consider the question of determining if a 10-year-old girl, from first class would have survived. The answer to the first question (was she female?) is yes, so we take the left branch of the tree. In the two following questions the answers are no (was she from third class?) and yes (was she from first class?), so we take the left and right branch respectively. At this time, we have reached a leaf. In the training set, we had 102 people with these attributes, 97 of them survivors. So, our answer would be survived.
In general, we found reasonable results: the group with more casualties (449 from 496) corresponded to adult men from second or third class, as you can check in the tree. Most girls from first class, on the other side, survived. Let's measure the accuracy of our method in the training set (we will first define a helper function to measure the performance of a classifier):
>>> from sklearn import metrics >>> def measure_performance(X,y,clf, show_accuracy=True, show_classification_report=True, show_confussion_matrix=True): >>> y_pred=clf.predict(X) >>> if show_accuracy: >>> print "Accuracy:{0:.3f}".format( >>> metrics.accuracy_score(y, y_pred) >>> ),"\n" >>> >>> if show_classification_report: >>> print "Classification report" >>> print metrics.classification_report(y,y_pred),"\n" >>> >>> if show_confussion_matrix: >>> print "Confussion matrix" >>> print metrics.confusion_matrix(y,y_pred),"\n" >>> measure_performance(X_train,y_train,clf, show_classification=False, show_confusion_matrix=False)) Accuracy:0.838
Our tree has an accuracy of 0.838 on the training set. But remember that this is not a good indicator. This is especially true for decision trees as this method is highly susceptible to overfitting. Since we did not separate an evaluation set, we should apply cross-validation. For this example, we will use an extreme case of cross-validation, named leave-one-out cross-validation. For each instance in the training sample, we train on the rest of the sample, and evaluate the model built on the only instance left out. After performing as many classifications as training instances, we calculate the accuracy simply as the proportion of times our method correctly predicted the class of the left-out instance, and found it is a little lower (as we expected) than the resubstitution accuracy on the training set.
>>> from sklearn.cross_validation import cross_val_score, LeaveOneOut >>> from scipy.stats import sem >>> >>> def loo_cv(X_train, y_train,clf): >>> # Perform Leave-One-Out cross validation >>> # We are preforming 1313 classifications! >>> loo = LeaveOneOut(X_train[:].shape[0]) >>> scores = np.zeros(X_train[:].shape[0]) >>> for train_index, test_index in loo: >>> X_train_cv, X_test_cv = X_train[train_index], X_train[test_index] >>> y_train_cv, y_test_cv = y_train[train_index], y_train[test_index] >>> clf = clf.fit(X_train_cv,y_train_cv) >>> y_pred = clf.predict(X_test_cv) >>> scores[test_index] = metrics.accuracy_score( y_test_cv.astype(int), y_pred.astype(int)) >>> print ("Mean score: {0:.3f} (+/-{1:.3f})").format(np.mean(scores), sem(scores)) >>> loo_cv(X_train, y_train,clf) Mean score: 0.837 (+/-0.012)
The main advantage of leave-one-out cross-validation is that it allows almost as much data for training as we have available, so it is particularly well suited for those cases where data is scarce. Its main problem is that training a different classifier for each instance could be very costly in terms of the computation time.
A big question remains here: how we selected the hyperparameters for our method instantiation? This problem is a general one, it is called model selection, and we will address it in more detail in Chapter 4, Advanced Features.
Random Forests – randomizing decisions
A common criticism to decision trees is that once the training set is divided after answering a question, it is not possible to reconsider this decision. For example, if we divide men and women, every subsequent question would be only about men or women, and the method could not consider another type of question (say, age less than a year, irrespective of the gender). Random Forests try to introduce some level of randomization in each step, proposing alternative trees and combining them to get the final prediction. These types of algorithms that consider several classifiers answering the same question are called ensemble methods. In the Titanic task, it is probably hard to see this problem because we have very few features, but consider the case when the number of features is in the order of thousands.
Random Forests propose to build a decision tree based on a subset of the training instances (selected randomly, with replacement), but using a small random number of features at each set from the feature set. This tree growing process is repeated several times, producing a set of classifiers. At prediction time, each grown tree, given an instance, predicts its target class exactly as decision trees do. The class that most of the trees vote (that is the class most predicted by the trees) is the one suggested by the ensemble classifier.
In scikit-learn, using Random Forests is as simple as importing RandomForestClassifier
from the sklearn.ensemble
module, and fitting the training data as follows:
>>> from sklearn.ensemble import RandomForestClassifier >>> clf = RandomForestClassifier(n_estimators=10, random_state=33) >>> clf = clf.fit(X_train, y_train) >>> loo_cv(X_train, y_train, clf) Mean score: 0.817 (+/-0.012)
We find that results are actually worse for Random Forests. It seems that introducing randomization was, after all, not a good idea because the number of features was too small. However, for bigger datasets, with a bigger number of features, Random Forests is a very fast, simple, and popular method to improve accuracy, retaining the virtues of decision trees. Actually, in the next section, we will use them for regression.
Evaluating the performance
The final step in every supervised learning task should be to evaluate our best classifier on the previously unseen data, to get an idea of its prediction performance. Remember, this step should not be used to select among competing methods or parameters. That would be cheating (because again, we risk overfitting the new data). So, in our case, let's measure the performance of decision trees on the testing data.
>>> clf_dt = tree.DecisionTreeClassifier(criterion='entropy', max_depth=3, min_samples_leaf=5) >>> clf_dt.fit(X_train, y_train) >>> measure_performance(X_test, y_test, clf_dt) Accuracy:0.793 Classification report precision recall f1-score support 0 0.77 0.96 0.85 202 1 0.88 0.54 0.67 127 avg / total 0.81 0.79 0.78 329 Confusion matrix [[193 9] [ 59 68]]
From the classification results and the confusion matrix, it seems that our method tends to predict too much that the person did not survive.
Predicting house prices with regression
In every example we have seen so far, we have faced what in Chapter 1, Machine Learning – A Gentle Introduction, we called classification problems: the output we aimed at predicting belonged to a discrete set. But often, we would want to predict a value extracted from the real line. The learning schema is still the same: fit a model to the training data, and evaluate on new data to get the target class whose value is a real number. Our classifier, instead of selecting a class from a list, should act as a real-valued function, which for each of the (possibly infinite) combination of learning features returns a real number. We could consider regression as classification with an infinite number of target classes.
Many problems can be modeled both as classification and regression tasks, depending on the class we selected as the target. For example, predicting blood sugar level is a regression task, while predicting if somebody has diabetes or not is a classification task.
In the example of the first figure, we have used a line to fit the learning data (composed of a sole attribute and a target value), that is, we have performed linear regression. If we want to predict the value of a new instance, we get their real-valued attribute and obtain the predicted value by projecting the inferred line into the second axis.
In this section, we will compare several regression methods by using the same dataset. We will try to predict the price of a house as a function of its attributes. As the dataset, we will use the Boston house-prices dataset, which includes 506 instances, representing houses in the suburbs of Boston by 14 features, one of them (the median value of owner-occupied homes) being the target class (for a detailed reference, see http://archive.ics.uci.edu/ml/datasets/Housing). Each attribute in this dataset is real-valued.
The dataset is included in the standard scikit-learn distribution, so let's start by loading it:
>>> import numpy as np >>> import matplotlib.pyplot as plt >>> from sklearn.datasets import load_boston >>> boston = load_boston() >>> print boston.data.shape (506, 13) >>> print boston.feature_names ['CRIM' 'ZN' 'INDUS' 'CHAS' 'NOX' 'RM' 'AGE' 'DIS' 'RAD' 'TAX' 'PTRATIO' 'B' 'LSTAT' 'MEDV'] >>> print np.max(boston.target), np.min(boston.target), np.mean(boston.target) 50.0 5.0 22.5328063241
You should try printing boston.DESCR
to get a feel of what each feature means. This is a very healthy habit: machine learning is not just number crunching, understanding the problem we are facing is crucial, especially to select the best learning model to use.
As usual, we start slicing our learning set into training and testing datasets, and normalizing the data:
>>> from sklearn.cross_validation import train_test_split >>> X_train, X_test, y_train, y_test = train_test_split(boston.data, boston.target, test_size=0.25, random_state=33) >>> from sklearn.preprocessing import StandardScaler >>> scalerX = StandardScaler().fit(X_train) >>> scalery = StandardScaler().fit(y_train) >>> X_train = scalerX.transform(X_train) >>> y_train = scalery.transform(y_train) >>> X_test = scalerX.transform(X_test) >>> y_test = scalery.transform(y_test)
Before looking at our best classifier, let's define how we will compare our results. Since we want to preserve our testing set for evaluating the performance of the final classifier, we should find a way to select the best model while avoiding overfitting. We already know the answer: cross-validation. Regression poses an additional problem: how should we evaluate our results? Accuracy is not a good idea, since we are predicting real values, it is almost impossible for us to predict exactly the final value. There are several measures that can be used (you can look at the list of functions under sklearn.metrics
module). The most common is the R2 score, or coefficient of determination that measures the proportion of the outcomes variation explained by the model, and is the default score function for regression methods in scikit-learn. This score reaches its maximum value of 1
when the model perfectly predicts all the test target values. Using this measure, we will build a function that trains a model and evaluates its performance using five-fold cross-validation and the coefficient of determination.
>>> from sklearn.cross_validation import * >>> def train_and_evaluate(clf, X_train, y_train): >>> clf.fit(X_train, y_train) >>> print "Coefficient of determination on training set:",clf.score(X_train, y_train) >>> # create a k-fold cross validation iterator of k=5 folds >>> cv = KFold(X_train.shape[0], 5, shuffle=True, random_state=33) >>> scores = cross_val_score(clf, X_train, y_train, cv=cv) >>> print "Average coefficient of determination using 5-fold crossvalidation:",np.mean(scores)
First try – a linear model
The question that linear models try to answer is which hyperplane in the 14-dimensional space created by our learning features (including the target value) is located closer to them. After this hyperplane is found, prediction reduces to calculate the projection on the hyperplane of the new point, and returning the target value coordinate. Think of our first example in Chapter 1, Machine Learning – A Gentle Introduction, where we wanted to find a line separating our training instances. We could have used that line to predict the second learning attribute as a function of the first one, that is, linear regression.
But, what do we mean by closer? The usual measure is least squares: calculate the distance of each instance to the hyperplane, square it (to avoid sign problems), and sum them. The hyperplane whose sum is smaller is the least squares estimator (the hyperplane in the case if two dimensions are just a line).
Since we don't know how our data fits (it is difficult to print a 14-dimension scatter plot!), we will start with a linear model called SGDRegressor
, which tries to minimize squared loss.
>>> from sklearn import linear_model >>> clf_sgd = linear_model.SGDRegressor(loss='squared_loss', penalty=None, random_state=42) >>> train_and_evaluate(clf_sgd,X_train,y_train) Coefficient of determination on training set: 0.743303511411 Average coefficient of determination using 5-fold crossvalidation: 0.715166411086
We can print the hyperplane coefficients our method has calculated, which is as follows:
>>> print clf_sgd.coef_ [-0.07641527 0.06963738 -0.05935062 0.10878438 -0.06356188 0.37260998 -0.02912886 -0.20180631 0.08463607 -0.05534634 -0.19521922 0.0653966 -0.36990842]
You probably noted the penalty=None
parameter when we called the method. The penalization parameter for linear regression methods is introduced to avoid overfitting. It does this by penalizing those hyperplanes having some of their coefficients too large, seeking hyperplanes where each feature contributes more or less the same to the predicted value. This parameter is generally the L2 norm (the squared sums of the coefficients) or the L1 norm (that is the sum of the absolute value of the coefficients). Let's see how our model works if we introduce an L2 penalty.
>>> clf_sgd1 = linear_model.SGDRegressor(loss='squared_loss', penalty='l2', random_state=42) >>> train_and_evaluate(clf_sgd1, X_train, y_train) Coefficient of determination on training set: 0.743300616394 Average coefficient of determination using 5-fold crossvalidation: 0.715166962417
In this case, we did not obtain an improvement.
Second try – Support Vector Machines for regression
The regression version of SVM can be used instead to find the hyperplane.
>>> from sklearn import svm >>> clf_svr = svm.SVR(kernel='linear') >>> train_and_evaluate(clf_svr, X_train, y_train) Coefficient of determination on training set: 0.71886923342 Average coefficient of determination using 5-fold crossvalidation: 0.694983285734
Here, we had no improvement. However, one of the main advantages of SVM is that (using what we called the kernel trick) we can use a nonlinear function, for example, a polynomial function to approximate our data.
>>> clf_svr_poly = svm.SVR(kernel='poly') >>> train_and_evaluate(clf_svr_poly, X_train, y_train) Coefficient of determination on training set: 0.904109273301 Average coefficient of determination using 5-fold cross validation: 0.754993478137
Now, our results are six points better in terms of coefficient of determination. We can actually improve this by using a Radial Basis Function (RBF) kernel.
>>> clf_svr_rbf = svm.SVR(kernel='rbf') >>> train_and_evaluate(clf_svr_rbf, X_train, y_train) Coefficient of determination on training set: 0.900132065979 Average coefficient of determination using 5-fold cross validation: 0.821626135903
RBF kernels have been used in several problems and have shown to be very effective. Actually, RBF is the default kernel used by SVM methods in scikit-learn.
Third try – Random Forests revisited
We can try a very different approach to regression using Random Forests. We have previously used Random Forests for classification. When used for regression, the tree growing procedure is exactly the same, but at prediction time, when we arrive at a leaf, instead of reporting the majority class, we return a representative real value, for example, the average of the target values.
Actually, we will use Extra Trees, implemented in the ExtraTreesRegressor
class within the sklearn.ensemble
module. This method adds an extra level of randomization. It not only selects for each tree a different, random subset of features, but also randomly selects the threshold for each decision.
>>> from sklearn import ensemble >>> clf_et=ensemble.ExtraTreesRegressor(n_estimators=10, compute_importances=True, random_state=42) >>> train_and_evaluate(clf_et, X_train, y_train) Coefficient of determination on training set: 1.0 Average coefficient of determination using 5-fold cross validation: 0.852511952001
The first thing to note is that we have not only completely eliminated underfitting (achieving perfect prediction on training values), but also improved the performance by three points while using cross-validation. An interesting feature of Extra Trees is that they allow computing the importance of each feature for the regression task. Let's compute this importance as follows:
>>> print sort(zip(clf_et.feature_importances_, boston.feature_names), axis=0) [['0.000231085384564' 'AGE'] ['0.000909210196652' 'B'] ['0.00162702734638' 'CHAS'] ['0.00292361527201' 'CRIM'] ['0.00472492264278' 'DIS'] ['0.00489022243822' 'INDUS'] ['0.0067481487587' 'LSTAT'] ['0.00852353178943' 'NOX'] ['0.00873406149286' 'PTRATIO'] ['0.0366902590312' 'RAD'] ['0.0982265323415' 'RM'] ['0.385904111089' 'TAX'] ['0.439867272217' 'ZN']]
We can see that ZN
(proportion of residential land zoned for lots over 25,000 sq. ft.) and TAX
(full-value property tax rate) are by far the most influential features on our final decision.
Evaluation
As usual, let's evaluate the performance of our best method on the testing set (previously, we slightly modified our measure_performance
function to show the coefficient of determination):
>>> from sklearn import metrics >>> def measure_performance(X, y, clf, show_accuracy=True, show_classification_report=True, show_confusion_matrix=True, show_r2_score=False): >>> y_pred = clf.predict(X) >>> if show_accuracy: >>> print "Accuracy:{0:.3f}".format( >>> metrics.accuracy_score(y, y_pred) >>> ),"\n" >>> >>> if show_classification_report: >>> print "Classification report" >>> print metrics.classification_report(y, y_pred),"\n" >>> >>> if show_confusion_matrix: >>> print "Confusion matrix" >>> print metrics.confusion_matrix(y, y_pred),"\n" >>> >>> if show_r2_score: >>> print "Coefficient of determination:{0:.3f}".format( >>> metrics.r2_score(y, y_pred) >>> ),"\n" >>> measure_performance(X_test, y_test, clf_et, show_accuracy=False, show_classification_report=False, show_confusion_matrix=False, show_r2_score=True) Coefficient of determination:0.793
Once we have selected our best method and used all the available data, we could train our best method on the whole training set, but we will have no way to measure its performance on future data, simply because we do not have any more data available.
Summary
In this chapter we reviewed some of the most common supervised learning methods and some practical applications. We learned that supervised methods require instances to have both input features and a target class. In the next chapter, we will review unsupervised learning methods that do not require a target class to be learned. These methods are very useful to understand the structure of the data and can also be used as a previous step before utilizing a supervised learning model.