The logistic regression model
We've already discussed the theory behind logistic regression, so we can begin fitting our models. An R installation comes with the glm() function fitting the generalized linear models, which are a class of models that includes logistic regression. The code syntax is similar to the lm() function that we used in the previous chapter. One big difference is that we must use the family = binomial argument in the function, which tells R to run a logistic regression method instead of the other versions of the generalized linear models. We will start by creating a model that includes all of the features on the train set and see how it performs on the test set, as follows:
> full.fit <- glm(class ~ ., family = binomial,
data = train)
> summary(full.fit)
Call:
glm(formula = class ~ ., family = binomial, data =
train)
Deviance Residuals:
Min 1Q Median 3Q Max
-3.3397 -0.1387 -0.0716 0.0321 2.3559
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -9.4293 1.2273 -7.683 1.55e-14
***
thick 0.5252 0.1601 3.280 0.001039 **
u.size -0.1045 0.2446 -0.427 0.669165
u.shape 0.2798 0.2526 1.108 0.268044
adhsn 0.3086 0.1738 1.776 0.075722 .
s.size 0.2866 0.2074 1.382 0.167021
nucl 0.4057 0.1213 3.344 0.000826
***
chrom 0.2737 0.2174 1.259 0.208006
n.nuc 0.2244 0.1373 1.635 0.102126
mit 0.4296 0.3393 1.266 0.205402
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05
'.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to
be 1)
Null deviance: 620.989 on 473 degrees of
freedom
Residual deviance: 78.373 on 464 degrees of
freedom
AIC: 98.373
Number of Fisher Scoring iterations: 8
The summary() function allows us to inspect the coefficients and their p-values. We can see that only two features have p-values less than 0.05 (thickness and nuclei). An examination of the 95 percent confidence intervals can be called on with the confint() function, as follows:
> confint(full.fit)
2.5 % 97.5 %
(Intercept) -12.23786660 -7.3421509
thick 0.23250518 0.8712407
u.size -0.56108960 0.4212527
u.shape -0.24551513 0.7725505
adhsn -0.02257952 0.6760586
s.size -0.11769714 0.7024139
nucl 0.17687420 0.6582354
chrom -0.13992177 0.7232904
n.nuc -0.03813490 0.5110293
mit -0.14099177 1.0142786
Note that the two significant features have confidence intervals that do not cross zero. You cannot translate the coefficients in logistic regression as the change in Y is based on a one-unit change in X. This is where the odds ratio can be quite helpful. The beta coefficients from the log function can be converted to odds ratios with an exponent (beta).
In order to produce the odds ratios in R, we will use the following exp(coef()) syntax:
> exp(coef(full.fit))
(Intercept) thick u.size u.shape
adhsn
8.033466e-05 1.690879e+00 9.007478e-01 1.322844e+00
1.361533e+00
s.size nucl chrom n.nuc mit
1.331940e+00 1.500309e+00 1.314783e+00 1.251551e+00
1.536709e+00
The interpretation of an odds ratio is the change in the outcome odds resulting from a unit change in the feature. If the value is greater than 1, it indicates that, as the feature increases, the odds of the outcome increase. Conversely, a value less than 1 would mean that, as the feature increases, the odds of the outcome decrease. In this example, all the features except u.size will increase the log odds.
One of the issues pointed out during data exploration was the potential issue of multi-collinearity. It is possible to produce the VIF statistics that we did in linear regression with a logistic model in the following way:
> library(car)
> vif(full.fit)
thick u.size u.shape adhsn s.size nucl
chrom n.nuc
1.2352 3.2488 2.8303 1.3021 1.6356 1.3729
1.5234 1.3431
mit
1.059707
None of the values are greater than the VIF rule of thumb statistic of five, so collinearity does not seem to be a problem. Feature selection will be the next task; but, for now, let's produce some code to look at how well this model does on both the train and test sets.
You will first have to create a vector of the predicted probabilities, as follows:
> train.probs <- predict(full.fit, type =
"response")
> train.probs[1:5] #inspect the first 5 predicted
probabilities
[1] 0.02052820 0.01087838 0.99992668 0.08987453
0.01379266
Next, we need to evaluate how well the model performed in training and then evaluate how it fits on the test set. A quick way to do this is to produce a confusion matrix. In later chapters, we will examine the version provided by the caret package. There is also a version provided in the InformationValue package. This is where we will need the outcome as 0's and 1's. The default value by which the function selects either benign or malignant is 0.50, which is to say that any probability at or above 0.50 is classified as malignant:
> trainY <- y[ind==1]
> testY <- y[ind==2]
> confusionMatrix(trainY, train.probs)
0 1
0 294 7
1 8 165
The rows signify the predictions, and the columns signify the actual values. The diagonal elements are the correct classifications. The top right value, 7, is the number of false negatives, and the bottom left value, 8, is the number of false positives. We can also take a look at the error rate, as follows:
> misClassError(trainY, train.probs)
[1] 0.0316
It seems we have done a fairly good job with only a 3.16% error rate on the training set. As we previously discussed, we must be able to accurately predict unseen data, in other words, our test set.
The method to create a confusion matrix for the test set is similar to how we did it for the training data:
> test.probs <- predict(full.fit, newdata = test,
type = "response")
> misClassError(testY, test.probs)
[1] 0.0239
> confusionMatrix(testY, test.probs)
0 1
0 139 2
1 3 65
It looks like we have done pretty well in creating a model with all the features. The roughly 98 percent prediction rate is quite impressive. However, we must still check whether there is room for improvement. Imagine that you or your loved one is a patient who has been diagnosed incorrectly. As previously mentioned, the implications can be quite dramatic. With this in mind, is there perhaps a better way to create a classification algorithm? Let's have a look!